Team 6 — Milestone 2¶
Members: John Holik, Claiton Pinto, Marina Bunyatova
Dataset: credit_risk_dataset.csv
Source: https://www.kaggle.com/datasets/laotse/credit-risk-dataset
Target: loan_status (binary: 0/1)
0. Dataset Context (From Initial Inspection)¶
- Rows × Cols: 32,581 × 12
- Target Distribution (
loan_status): 0 ≈ 78.18%, 1 ≈ 21.82% - Missingness (major):
loan_int_rate≈ 9.56%,person_emp_length≈ 2.75% - Duplicates: 165 exact duplicate rows detected — plan: drop before modeling (document this step).
- Quick outliers:
person_age > 100(5 rows),person_emp_length > 50(2 rows) — plan: handle via capping / rejection rules (justify).
1. Environment & Reproducibility¶
Import Required Libraries
# ========== Standard library ==========
import os
import sys
import json
import random
import warnings
from pathlib import Path
from datetime import datetime
# ========== Core data science ==========
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
import scipy.stats as stats
# ========== Plotting ==========
import matplotlib.pyplot as plt
import seaborn as sns
# ========== Pipeline & preprocessing ==========
from sklearn.model_selection import (
train_test_split,
StratifiedKFold,
GridSearchCV,
RandomizedSearchCV,
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.calibration import CalibratedClassifierCV
# ========== Metrics & evaluation ==========
from sklearn.metrics import (
roc_auc_score,
roc_curve,
average_precision_score,
precision_recall_curve,
f1_score,
accuracy_score,
balanced_accuracy_score,
confusion_matrix,
classification_report,
brier_score_loss,
)
# ========== Calibration ==========
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
# ========== Interpretation ==========
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
# ========== Models ==========
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
AdaBoostClassifier,
GradientBoostingClassifier,
)
from sklearn.svm import SVC, LinearSVC
from sklearn.neural_network import MLPClassifier
# ========== Persistence ==========
from joblib import dump, load
2. Data Loading & Auditing¶
- Load
/mnt/data/credit_risk_dataset.csv. - Basic audit: shape, dtypes, preview, memory usage.
- Duplicates: detect & drop (165 expected; verify).
- Missingness report: counts & percentages per column (focus on
loan_int_rate,person_emp_length). - Target audit: class balance & positive class definition.
- Persist a data dictionary (CSV) with types,
n_unique, missingness, example values.
df = pd.read_csv('credit_risk_dataset.csv')
print("======= Dataset Audit =======")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print("\nColumn Data Types:")
print(df.dtypes)
print("\nMissing Values per Column:")
print(df.isnull().sum())
print("\nDuplicate Rows:")
print(df.duplicated().sum())
print("\nStatistical Summary:")
print(df.describe(include='all'))
======= Dataset Audit =======
Number of rows: 32581
Number of columns: 12
Column Data Types:
person_age int64
person_income int64
person_home_ownership object
person_emp_length float64
loan_intent object
loan_grade object
loan_amnt int64
loan_int_rate float64
loan_status int64
loan_percent_income float64
cb_person_default_on_file object
cb_person_cred_hist_length int64
dtype: object
Missing Values per Column:
person_age 0
person_income 0
person_home_ownership 0
person_emp_length 895
loan_intent 0
loan_grade 0
loan_amnt 0
loan_int_rate 3116
loan_status 0
loan_percent_income 0
cb_person_default_on_file 0
cb_person_cred_hist_length 0
dtype: int64
Duplicate Rows:
165
Statistical Summary:
person_age person_income person_home_ownership person_emp_length \
count 32581.000000 3.258100e+04 32581 31686.000000
unique NaN NaN 4 NaN
top NaN NaN RENT NaN
freq NaN NaN 16446 NaN
mean 27.734600 6.607485e+04 NaN 4.789686
std 6.348078 6.198312e+04 NaN 4.142630
min 20.000000 4.000000e+03 NaN 0.000000
25% 23.000000 3.850000e+04 NaN 2.000000
50% 26.000000 5.500000e+04 NaN 4.000000
75% 30.000000 7.920000e+04 NaN 7.000000
max 144.000000 6.000000e+06 NaN 123.000000
loan_intent loan_grade loan_amnt loan_int_rate loan_status \
count 32581 32581 32581.000000 29465.000000 32581.000000
unique 6 7 NaN NaN NaN
top EDUCATION A NaN NaN NaN
freq 6453 10777 NaN NaN NaN
mean NaN NaN 9589.371106 11.011695 0.218164
std NaN NaN 6322.086646 3.240459 0.413006
min NaN NaN 500.000000 5.420000 0.000000
25% NaN NaN 5000.000000 7.900000 0.000000
50% NaN NaN 8000.000000 10.990000 0.000000
75% NaN NaN 12200.000000 13.470000 0.000000
max NaN NaN 35000.000000 23.220000 1.000000
loan_percent_income cb_person_default_on_file \
count 32581.000000 32581
unique NaN 2
top NaN N
freq NaN 26836
mean 0.170203 NaN
std 0.106782 NaN
min 0.000000 NaN
25% 0.090000 NaN
50% 0.150000 NaN
75% 0.230000 NaN
max 0.830000 NaN
cb_person_cred_hist_length
count 32581.000000
unique NaN
top NaN
freq NaN
mean 5.804211
std 4.055001
min 2.000000
25% 3.000000
50% 4.000000
75% 8.000000
max 30.000000
3. Exploratory Data Analysis (EDA)¶
- Univariate distributions (numeric histograms, categorical bar plots).
- Target vs feature relationships (box/violin for numeric; stacked bars for categorical).
- Correlations (numeric Pearson/Spearman) and multicollinearity scan.
- Detect potential data leakage.
# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")
print("======= EXPLORATORY DATA ANALYSIS =======\n")
# Create figure directory if it doesn't exist
os.makedirs('Output', exist_ok=True)
# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"Numeric columns: {numeric_cols}")
print(f"Categorical columns: {categorical_cols}")
# ========== 1. UNIVARIATE DISTRIBUTIONS ==========
print("\n1. UNIVARIATE DISTRIBUTIONS")
print("="*50)
# Numeric distributions
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()
for i, col in enumerate(numeric_cols):
if i < len(axes):
df[col].hist(bins=30, ax=axes[i], alpha=0.7, edgecolor='black')
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')
# Remove empty subplots
for j in range(len(numeric_cols), len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.savefig('Output/numeric_distributions.png', dpi=300, bbox_inches='tight')
plt.show()
# Categorical distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()
for i, col in enumerate(categorical_cols):
if i < len(axes):
value_counts = df[col].value_counts()
value_counts.plot(kind='bar', ax=axes[i], alpha=0.7)
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Count')
axes[i].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('Output/categorical_distributions.png', dpi=300, bbox_inches='tight')
plt.show()
# ========== 2. TARGET VS FEATURE RELATIONSHIPS ==========
print("\n2. TARGET vs FEATURE RELATIONSHIPS")
print("="*50)
# Target distribution
print("Target distribution:")
target_dist = df['loan_status'].value_counts()
print(target_dist)
print(f"Percentage: {target_dist / len(df) * 100}")
# Numeric features vs target (box plots)
numeric_features = [col for col in numeric_cols if col != 'loan_status']
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()
for i, col in enumerate(numeric_features):
if i < len(axes):
df.boxplot(column=col, by='loan_status', ax=axes[i])
axes[i].set_title(f'{col} by Loan Status')
axes[i].set_xlabel('Loan Status')
axes[i].set_ylabel(col)
# Remove empty subplots
for j in range(len(numeric_features), len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.savefig('Output/numeric_vs_target_boxplots.png', dpi=300, bbox_inches='tight')
plt.show()
# Categorical features vs target (stacked bars)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()
for i, col in enumerate(categorical_cols):
if i < len(axes):
crosstab = pd.crosstab(df[col], df['loan_status'], normalize='index')
crosstab.plot(kind='bar', stacked=True, ax=axes[i], alpha=0.7)
axes[i].set_title(f'{col} vs Loan Status (Normalized)')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Proportion')
axes[i].legend(['No Default', 'Default'])
axes[i].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('Output/categorical_vs_target_stacked.png', dpi=300, bbox_inches='tight')
plt.show()
# ========== 3. CORRELATIONS AND MULTICOLLINEARITY ==========
print("\n3. CORRELATIONS AND MULTICOLLINEARITY")
print("="*50)
# Pearson correlation for numeric features
numeric_df = df[numeric_cols]
pearson_corr = numeric_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Pearson Correlation Matrix')
plt.tight_layout()
plt.savefig('Output/pearson_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
# Spearman correlation
spearman_corr = numeric_df.corr(method='spearman')
plt.figure(figsize=(10, 8))
sns.heatmap(spearman_corr, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Spearman Correlation Matrix')
plt.tight_layout()
plt.savefig('Output/spearman_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
# High correlation pairs (potential multicollinearity)
print("High correlation pairs (|r| > 0.7):")
high_corr_pairs = []
for i in range(len(pearson_corr.columns)):
for j in range(i+1, len(pearson_corr.columns)):
corr_val = abs(pearson_corr.iloc[i, j])
if corr_val > 0.7:
high_corr_pairs.append((pearson_corr.columns[i], pearson_corr.columns[j], corr_val))
print(f"{pearson_corr.columns[i]} - {pearson_corr.columns[j]}: {corr_val:.3f}")
if not high_corr_pairs:
print("No high correlations detected (threshold: 0.7)")
# ========== 4. DATA LEAKAGE DETECTION ==========
print("\n4. DATA LEAKAGE DETECTION")
print("="*50)
# Check for perfect or near-perfect correlations with target
target_corr = numeric_df.corr()['loan_status'].abs().sort_values(ascending=False)
print("Correlations with target (loan_status):")
print(target_corr)
# Flag suspiciously high correlations
suspicious_features = target_corr[(target_corr > 0.8) & (target_corr < 1.0)]
if len(suspicious_features) > 0:
print(f"\nPOTENTIAL LEAKAGE: Features with suspiciously high correlation with target:")
for feature, corr in suspicious_features.items():
print(f" - {feature}: {corr:.3f}")
else:
print("\nNo obvious leakage detected based on correlations")
# Check categorical associations with target using Chi-square
print("\nCategorical feature associations with target (Chi-square p-values):")
for col in categorical_cols:
contingency_table = pd.crosstab(df[col], df['loan_status'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"{col}: p-value = {p_value:.6f} {'(significant)' if p_value < 0.05 else '(not significant)'}")
# Summary statistics by target
print("\nSummary statistics by loan_status:")
print(df.groupby('loan_status')[numeric_features].mean())
print("\n======= EDA COMPLETE =======")
print("Figures saved to 'Output/' directory")
======= EXPLORATORY DATA ANALYSIS ======= Numeric columns: ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_cred_hist_length'] Categorical columns: ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file'] 1. UNIVARIATE DISTRIBUTIONS ==================================================
2. TARGET vs FEATURE RELATIONSHIPS ================================================== Target distribution: loan_status 0 25473 1 7108 Name: count, dtype: int64 Percentage: loan_status 0 78.183604 1 21.816396 Name: count, dtype: float64
3. CORRELATIONS AND MULTICOLLINEARITY ==================================================
High correlation pairs (|r| > 0.7):
person_age - cb_person_cred_hist_length: 0.859
4. DATA LEAKAGE DETECTION
==================================================
Correlations with target (loan_status):
loan_status 1.000000
loan_percent_income 0.379366
loan_int_rate 0.335133
person_income 0.144449
loan_amnt 0.105376
person_emp_length 0.082489
person_age 0.021629
cb_person_cred_hist_length 0.015529
Name: loan_status, dtype: float64
No obvious leakage detected based on correlations
Categorical feature associations with target (Chi-square p-values):
person_home_ownership: p-value = 0.000000 (significant)
loan_intent: p-value = 0.000000 (significant)
loan_grade: p-value = 0.000000 (significant)
cb_person_default_on_file: p-value = 0.000000 (significant)
Summary statistics by loan_status:
person_age person_income person_emp_length loan_amnt \
loan_status
0 27.807129 70804.361559 4.968745 9237.464178
1 27.474676 49125.652223 4.137562 10850.502954
loan_int_rate loan_percent_income cb_person_cred_hist_length
loan_status
0 10.435999 0.148805 5.837475
1 13.060207 0.246889 5.685003
======= EDA COMPLETE =======
Figures saved to 'Output/' directory
4. Data Preprocessing Pipeline¶
- Load the credit risk dataset from CSV file
- Display dataset dimensions (rows and columns)
- Audit column data types to identify numeric vs categorical features
- Report missing values per column with counts
- Detect and count duplicate rows (expected ~165 duplicates)
- Generate statistical summary for all features using
describe() - Provide a comprehensive data quality snapshot before preprocessing
# ========== 4. DATA PREPROCESSING PIPELINE ==========
print("======= DATA PREPROCESSING PIPELINE =======\n")
# Create artifacts directory
os.makedirs('artifacts', exist_ok=True)
os.makedirs('models', exist_ok=True)
# ========== 1. HANDLE DUPLICATES ==========
print("1. HANDLING DUPLICATES")
print("="*50)
n_duplicates = df.duplicated().sum()
print(f"Duplicate rows found: {n_duplicates}")
if n_duplicates > 0:
df_clean = df.drop_duplicates()
print(f"Rows after removing duplicates: {df_clean.shape[0]} (removed {n_duplicates} rows)")
else:
df_clean = df.copy()
print("No duplicates to remove.")
# ========== 2. HANDLE OUTLIERS ==========
print("\n2. HANDLING OUTLIERS")
print("="*50)
# Domain-based outlier capping
outlier_rules = {
'person_age': 100,
'person_emp_length': 50
}
print("Applying domain-based outlier caps:")
for col, cap in outlier_rules.items():
n_outliers = (df_clean[col] > cap).sum()
if n_outliers > 0:
print(f" - {col}: capping {n_outliers} values > {cap}")
df_clean.loc[df_clean[col] > cap, col] = cap
else:
print(f" - {col}: no outliers detected (threshold: {cap})")
# ========== 3. SEPARATE FEATURES AND TARGET ==========
print("\n3. SEPARATING FEATURES AND TARGET")
print("="*50)
X = df_clean.drop('loan_status', axis=1)
y = df_clean['loan_status']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Target distribution:\n{y.value_counts()}")
print(f"Target balance: {y.value_counts(normalize=True).to_dict()}")
# ========== 4. TRAIN-TEST SPLIT ==========
print("\n4. TRAIN-TEST SPLIT")
print("="*50)
# 80-20 split with stratification
RANDOM_STATE = 42
TEST_SIZE = 0.2
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=TEST_SIZE,
random_state=RANDOM_STATE,
stratify=y
)
print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Train target distribution:\n{y_train.value_counts()}")
print(f"Test target distribution:\n{y_test.value_counts()}")
# ========== 5. BUILD PREPROCESSING PIPELINE ==========
print("\n5. BUILDING PREPROCESSING PIPELINE")
print("="*50)
# Identify numeric and categorical columns (excluding target)
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
print(f"Numeric features ({len(numeric_features)}): {numeric_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
# Numeric pipeline: impute with median, then standardize
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical pipeline: impute with most frequent, then one-hot encode
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine into ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
],
remainder='drop' # Drop any columns not specified
)
print("\nPreprocessor defined with:")
print(f" - Numeric: median imputation + standard scaling")
print(f" - Categorical: most-frequent imputation + one-hot encoding")
# ========== 6. FIT PREPROCESSOR ON TRAINING DATA ==========
print("\n6. FITTING PREPROCESSOR (train data only)")
print("="*50)
preprocessor.fit(X_train)
print(" Preprocessor fitted on training data")
# Transform both train and test
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
print(f"Transformed train shape: {X_train_transformed.shape}")
print(f"Transformed test shape: {X_test_transformed.shape}")
# Get feature names after transformation
try:
feature_names_out = preprocessor.get_feature_names_out()
print(f"Total features after preprocessing: {len(feature_names_out)}")
except:
feature_names_out = None
print("Note: Feature names extraction not available for this sklearn version")
# ========== 7. PERSIST PREPROCESSOR ==========
print("\n7. PERSISTING PREPROCESSOR")
print("="*50)
preprocessor_path = 'models/preprocessor.joblib'
dump(preprocessor, preprocessor_path)
print(f" Preprocessor saved to: {preprocessor_path}")
# ========== 8. SAVE PRE-PROCESSING DATA DICTIONARY ==========
print("\n8. SAVING PRE-PROCESSING DATA DICTIONARY")
print("="*50)
data_dict = []
for col in X.columns:
data_dict.append({
'feature': col,
'dtype': str(X[col].dtype),
'n_missing': X[col].isnull().sum(),
'pct_missing': f"{X[col].isnull().sum() / len(X) * 100:.2f}%",
'n_unique': X[col].nunique(),
'sample_values': str(X[col].dropna().head(3).tolist())
})
data_dict_df = pd.DataFrame(data_dict)
dict_path = 'artifacts/data_dictionary_pre_preprocessing.csv'
data_dict_df.to_csv(dict_path, index=False)
print(f" Data dictionary saved to: {dict_path}")
print(data_dict_df)
# ========== 9. SUMMARY REPORT ==========
print("\n9. PREPROCESSING SUMMARY")
print("="*50)
print(f" Duplicates removed: {n_duplicates}")
print(f" Outliers capped: {sum([val for val in outlier_rules.values()])}")
print(f" Train/test split: {1-TEST_SIZE:.0%}/{TEST_SIZE:.0%} (stratified)")
print(f" Missing values handled via imputation")
print(f" Categorical features one-hot encoded")
print(f" Numeric features standardized")
print(f" Preprocessor persisted and ready for modeling")
# ========== 10. VISUALIZE TRANSFORMED DATA ==========
print("\n10. VISUALIZING TRANSFORMED DATA")
print("="*50)
# Create sample of transformed data for visualization
n_samples_to_plot = min(1000, X_train_transformed.shape[0])
sample_indices = np.random.choice(X_train_transformed.shape[0], n_samples_to_plot, replace=False)
X_sample = X_train_transformed[sample_indices]
y_sample = y_train.iloc[sample_indices]
# Plot 1: Distribution of first few transformed features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Distribution of Transformed Features (Sample)', fontsize=16, fontweight='bold')
for idx, ax in enumerate(axes.flat):
if idx < X_train_transformed.shape[1]:
ax.hist(X_train_transformed[:, idx], bins=50, alpha=0.7, edgecolor='black')
feature_name = feature_names_out[idx] if feature_names_out is not None else f'Feature {idx}'
ax.set_title(f'{feature_name}', fontsize=10)
ax.set_xlabel('Standardized Value')
ax.set_ylabel('Frequency')
ax.grid(True, alpha=0.3)
else:
ax.axis('off')
plt.tight_layout()
plt.savefig('artifacts/transformed_features_distribution.png', dpi=100, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/transformed_features_distribution.png")
# Plot 2: Class distribution in transformed space (PCA for visualization)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_sample)
fig, ax = plt.subplots(figsize=(10, 7))
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=y_sample, cmap='RdYlGn_r',
alpha=0.6, edgecolors='black', linewidth=0.5)
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)', fontsize=12)
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)', fontsize=12)
ax.set_title('Transformed Data in 2D (PCA Projection)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Loan Status (0=Good, 1=Default)', fontsize=11)
plt.tight_layout()
plt.savefig('artifacts/transformed_data_pca.png', dpi=100, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/transformed_data_pca.png")
# Plot 3: Feature correlation heatmap (subset of features)
n_features_to_show = min(15, X_train_transformed.shape[1])
X_subset = X_train_transformed[:, :n_features_to_show]
feature_subset_names = feature_names_out[:n_features_to_show] if feature_names_out is not None else [f'F{i}' for i in range(n_features_to_show)]
corr_matrix = np.corrcoef(X_subset.T)
fig, ax = plt.subplots(figsize=(12, 10))
im = ax.imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1, aspect='auto')
ax.set_xticks(np.arange(len(feature_subset_names)))
ax.set_yticks(np.arange(len(feature_subset_names)))
ax.set_xticklabels(feature_subset_names, rotation=45, ha='right', fontsize=9)
ax.set_yticklabels(feature_subset_names, fontsize=9)
ax.set_title('Correlation Matrix of Transformed Features (Subset)', fontsize=14, fontweight='bold', pad=20)
plt.colorbar(im, ax=ax, label='Correlation Coefficient')
plt.tight_layout()
plt.savefig('artifacts/transformed_features_correlation.png', dpi=100, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/transformed_features_correlation.png")
print("\n======= PREPROCESSING COMPLETE =======")
======= DATA PREPROCESSING PIPELINE =======
1. HANDLING DUPLICATES
==================================================
Duplicate rows found: 165
Rows after removing duplicates: 32416 (removed 165 rows)
2. HANDLING OUTLIERS
==================================================
Applying domain-based outlier caps:
- person_age: capping 5 values > 100
- person_emp_length: capping 2 values > 50
3. SEPARATING FEATURES AND TARGET
==================================================
Features shape: (32416, 11)
Target shape: (32416,)
Target distribution:
loan_status
0 25327
1 7089
Name: count, dtype: int64
Target balance: {0: 0.7813116979269497, 1: 0.21868830207305034}
4. TRAIN-TEST SPLIT
==================================================
Train set: 25932 samples
Test set: 6484 samples
Train target distribution:
loan_status
0 20261
1 5671
Name: count, dtype: int64
Test target distribution:
loan_status
0 5066
1 1418
Name: count, dtype: int64
5. BUILDING PREPROCESSING PIPELINE
==================================================
Numeric features (7): ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
Categorical features (4): ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']
Preprocessor defined with:
- Numeric: median imputation + standard scaling
- Categorical: most-frequent imputation + one-hot encoding
6. FITTING PREPROCESSOR (train data only)
==================================================
Preprocessor fitted on training data
Transformed train shape: (25932, 26)
Transformed test shape: (6484, 26)
Total features after preprocessing: 26
7. PERSISTING PREPROCESSOR
==================================================
Preprocessor saved to: models/preprocessor.joblib
8. SAVING PRE-PROCESSING DATA DICTIONARY
==================================================
Data dictionary saved to: artifacts/data_dictionary_pre_preprocessing.csv
feature dtype n_missing pct_missing n_unique \
0 person_age int64 0 0.00% 57
1 person_income int64 0 0.00% 4295
2 person_home_ownership object 0 0.00% 4
3 person_emp_length float64 887 2.74% 36
4 loan_intent object 0 0.00% 6
5 loan_grade object 0 0.00% 7
6 loan_amnt int64 0 0.00% 753
7 loan_int_rate float64 3095 9.55% 348
8 loan_percent_income float64 0 0.00% 77
9 cb_person_default_on_file object 0 0.00% 2
10 cb_person_cred_hist_length int64 0 0.00% 29
sample_values
0 [22, 21, 25]
1 [59000, 9600, 9600]
2 ['RENT', 'OWN', 'MORTGAGE']
3 [50.0, 5.0, 1.0]
4 ['PERSONAL', 'EDUCATION', 'MEDICAL']
5 ['D', 'B', 'C']
6 [35000, 1000, 5500]
7 [16.02, 11.14, 12.87]
8 [0.59, 0.1, 0.57]
9 ['Y', 'N', 'N']
10 [3, 2, 3]
9. PREPROCESSING SUMMARY
==================================================
Duplicates removed: 165
Outliers capped: 150
Train/test split: 80%/20% (stratified)
Missing values handled via imputation
Categorical features one-hot encoded
Numeric features standardized
Preprocessor persisted and ready for modeling
10. VISUALIZING TRANSFORMED DATA
==================================================
Saved: artifacts/transformed_features_distribution.png
Saved: artifacts/transformed_data_pca.png
Saved: artifacts/transformed_features_correlation.png ======= PREPROCESSING COMPLETE =======
5. Feature Engineering¶
- Create domain-driven engineered features to improve predictive power
- Compute debt-to-income ratio (
dti_ratio) as core credit risk metric - Generate interaction features: total loan cost, income-to-loan ratio, age-to-credit-history ratio
- Engineer employment stability score combining tenure and income
- Apply log transformations to reduce right-skew in income and loan amount
- Create categorical bins: DTI buckets, age groups, income quartiles for nonlinearity
- Build composite risk profile combining homeownership and default history
- Update preprocessing pipeline to handle engineered features
- Fit engineered preprocessor on training data only (avoid leakage)
- Transform both train and test sets with updated pipeline
- Track feature provenance (original vs. engineered) in metadata
- Persist engineered preprocessor and feature provenance to artifacts
- Visualize engineered feature distributions by target class
- Compute correlations between engineered features and loan default
- Document feature engineering rationale for transparency
- Total features increased from original to 10+ engineered features
# ========== 5. FEATURE ENGINEERING ==========
print("======= FEATURE ENGINEERING =======\n")
# We'll engineer features on the ORIGINAL data (before preprocessing)
# Then we'll create a new preprocessing pipeline that includes these features
print("1. CREATING ENGINEERED FEATURES")
print("="*50)
# Work with the cleaned dataframe (after duplicates/outliers removed)
X_eng = X_train.copy()
X_test_eng = X_test.copy()
# Track feature provenance
feature_provenance = {
'original': list(X_train.columns),
'engineered': []
}
# ========== DOMAIN-DRIVEN FEATURE ENGINEERING ==========
# 1. Debt-to-Income (DTI) ratio - critical for credit risk
X_eng['dti_ratio'] = X_eng['loan_amnt'] / (X_eng['person_income'] + 1) # +1 to avoid division by zero
X_test_eng['dti_ratio'] = X_test_eng['loan_amnt'] / (X_test_eng['person_income'] + 1)
feature_provenance['engineered'].append('dti_ratio')
print(" Created: dti_ratio = loan_amnt / person_income")
# 2. Total loan cost (interaction: amount × interest rate)
X_eng['total_loan_cost'] = X_eng['loan_amnt'] * (X_eng['loan_int_rate'].fillna(X_eng['loan_int_rate'].median()) / 100)
X_test_eng['total_loan_cost'] = X_test_eng['loan_amnt'] * (X_test_eng['loan_int_rate'].fillna(X_test_eng['loan_int_rate'].median()) / 100)
feature_provenance['engineered'].append('total_loan_cost')
print(" Created: total_loan_cost = loan_amnt × loan_int_rate")
# 3. Income-to-loan ratio (inverse of DTI, captures affordability from different angle)
X_eng['income_to_loan'] = (X_eng['person_income'] + 1) / (X_eng['loan_amnt'] + 1)
X_test_eng['income_to_loan'] = (X_test_eng['person_income'] + 1) / (X_test_eng['loan_amnt'] + 1)
feature_provenance['engineered'].append('income_to_loan')
print(" Created: income_to_loan = person_income / loan_amnt")
# 4. Age-to-credit-history ratio (financial maturity indicator)
X_eng['age_to_cred_hist'] = X_eng['person_age'] / (X_eng['cb_person_cred_hist_length'] + 1)
X_test_eng['age_to_cred_hist'] = X_test_eng['person_age'] / (X_test_eng['cb_person_cred_hist_length'] + 1)
feature_provenance['engineered'].append('age_to_cred_hist')
print(" Created: age_to_cred_hist = person_age / cb_person_cred_hist_length")
# 5. Employment stability score (emp_length × income)
X_eng['employment_stability'] = X_eng['person_emp_length'].fillna(0) * np.log1p(X_eng['person_income'])
X_test_eng['employment_stability'] = X_test_eng['person_emp_length'].fillna(0) * np.log1p(X_test_eng['person_income'])
feature_provenance['engineered'].append('employment_stability')
print(" Created: employment_stability = person_emp_length × log(person_income)")
# 6. Log transformations for skewed features
X_eng['log_income'] = np.log1p(X_eng['person_income'])
X_test_eng['log_income'] = np.log1p(X_test_eng['person_income'])
feature_provenance['engineered'].append('log_income')
print(" Created: log_income = log(1 + person_income)")
X_eng['log_loan_amnt'] = np.log1p(X_eng['loan_amnt'])
X_test_eng['log_loan_amnt'] = np.log1p(X_test_eng['loan_amnt'])
feature_provenance['engineered'].append('log_loan_amnt')
print(" Created: log_loan_amnt = log(1 + loan_amnt)")
# 7. DTI buckets (categorical binning for nonlinearity)
dti_bins = [0, 0.2, 0.4, 0.6, np.inf]
dti_labels = ['low_dti', 'medium_dti', 'high_dti', 'very_high_dti']
X_eng['dti_bucket'] = pd.cut(X_eng['dti_ratio'], bins=dti_bins, labels=dti_labels)
X_test_eng['dti_bucket'] = pd.cut(X_test_eng['dti_ratio'], bins=dti_bins, labels=dti_labels)
feature_provenance['engineered'].append('dti_bucket')
print(" Created: dti_bucket = binned(dti_ratio) [low, medium, high, very_high]")
# 8. Age groups (life stage proxy)
age_bins = [0, 25, 35, 50, np.inf]
age_labels = ['young', 'mid_career', 'established', 'senior']
X_eng['age_group'] = pd.cut(X_eng['person_age'], bins=age_bins, labels=age_labels)
X_test_eng['age_group'] = pd.cut(X_test_eng['person_age'], bins=age_bins, labels=age_labels)
feature_provenance['engineered'].append('age_group')
print(" Created: age_group = binned(person_age) [young, mid_career, established, senior]")
# 9. Income quantiles (relative income position)
income_quantiles = X_eng['person_income'].quantile([0.25, 0.5, 0.75]).values
X_eng['income_quartile'] = pd.cut(X_eng['person_income'],
bins=[-np.inf] + list(income_quantiles) + [np.inf],
labels=['Q1', 'Q2', 'Q3', 'Q4'])
X_test_eng['income_quartile'] = pd.cut(X_test_eng['person_income'],
bins=[-np.inf] + list(income_quantiles) + [np.inf],
labels=['Q1', 'Q2', 'Q3', 'Q4'])
feature_provenance['engineered'].append('income_quartile')
print(" Created: income_quartile = binned(person_income) [Q1, Q2, Q3, Q4]")
# 10. Interaction: homeownership + default history (high-risk profile)
X_eng['risk_profile'] = (X_eng['person_home_ownership'] == 'RENT').astype(int) + \
(X_eng['cb_person_default_on_file'] == 'Y').astype(int)
X_test_eng['risk_profile'] = (X_test_eng['person_home_ownership'] == 'RENT').astype(int) + \
(X_test_eng['cb_person_default_on_file'] == 'Y').astype(int)
feature_provenance['engineered'].append('risk_profile')
print(" Created: risk_profile = (is_renter) + (has_default_history)")
print(f"\n Total engineered features created: {len(feature_provenance['engineered'])}")
print(f" Original features: {len(feature_provenance['original'])}")
print(f" Engineered features: {len(feature_provenance['engineered'])}")
print(f" Total features: {len(feature_provenance['original']) + len(feature_provenance['engineered'])}")
# ========== 2. UPDATE PREPROCESSING PIPELINE ==========
print("\n2. UPDATING PREPROCESSING PIPELINE")
print("="*50)
# Identify new numeric and categorical columns
numeric_features_eng = X_eng.select_dtypes(include=[np.number]).columns.tolist()
categorical_features_eng = X_eng.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"Numeric features ({len(numeric_features_eng)}): {numeric_features_eng}")
print(f"Categorical features ({len(categorical_features_eng)}): {categorical_features_eng}")
# Create updated preprocessor
preprocessor_eng = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features_eng),
('cat', categorical_transformer, categorical_features_eng)
],
remainder='drop'
)
# Fit on engineered training data
preprocessor_eng.fit(X_eng)
print(" Engineered preprocessor fitted")
# Transform both sets
X_train_eng_transformed = preprocessor_eng.transform(X_eng)
X_test_eng_transformed = preprocessor_eng.transform(X_test_eng)
print(f"Transformed train shape: {X_train_eng_transformed.shape}")
print(f"Transformed test shape: {X_test_eng_transformed.shape}")
# ========== 3. PERSIST ARTIFACTS ==========
print("\n3. PERSISTING ARTIFACTS")
print("="*50)
# Save engineered preprocessor
preprocessor_eng_path = 'models/preprocessor_engineered.joblib'
dump(preprocessor_eng, preprocessor_eng_path)
print(f" Engineered preprocessor saved to: {preprocessor_eng_path}")
# Save feature provenance
provenance_path = 'artifacts/feature_provenance.json'
with open(provenance_path, 'w') as f:
json.dump(feature_provenance, f, indent=2)
print(f" Feature provenance saved to: {provenance_path}")
# Update transformed data variables for downstream use
X_train_transformed = X_train_eng_transformed
X_test_transformed = X_test_eng_transformed
preprocessor = preprocessor_eng
print(f" Updated X_train_transformed: {X_train_transformed.shape}")
print(f" Updated X_test_transformed: {X_test_transformed.shape}")
# ========== 4. FEATURE ENGINEERING SUMMARY ==========
print("\n4. FEATURE ENGINEERING RATIONALE")
print("="*50)
rationale = {
'dti_ratio': 'Core credit metric: higher DTI = higher default risk',
'total_loan_cost': 'Captures true loan burden (principal + interest)',
'income_to_loan': 'Affordability from inverse perspective',
'age_to_cred_hist': 'Financial maturity: younger age vs credit history may indicate instability',
'employment_stability': 'Combines job tenure and income level for stability proxy',
'log_income': 'Reduces right-skew in income distribution',
'log_loan_amnt': 'Reduces right-skew in loan amount distribution',
'dti_bucket': 'Captures nonlinear DTI thresholds (e.g., >40% DTI often flagged)',
'age_group': 'Life stage proxy: risk profiles differ by age bracket',
'income_quartile': 'Relative income position in population',
'risk_profile': 'Composite risk: renters with default history = highest risk'
}
print("\nFeature Engineering Decisions:")
for feat, reason in rationale.items():
print(f" • {feat}: {reason}")
# ========== 5. VISUALIZE ENGINEERED FEATURES ==========
print("\n5. VISUALIZING ENGINEERED FEATURES")
print("="*50)
# Create dataframe with engineered features + target for visualization
vis_df = X_eng.copy()
vis_df['loan_status'] = y_train.values
fig, axes = plt.subplots(3, 3, figsize=(18, 14))
fig.suptitle('Engineered Features Distribution by Loan Status', fontsize=16, fontweight='bold')
# Plot engineered numeric features
engineered_numeric = ['dti_ratio', 'total_loan_cost', 'income_to_loan',
'age_to_cred_hist', 'employment_stability',
'log_income', 'log_loan_amnt', 'risk_profile']
for idx, feat in enumerate(engineered_numeric):
if idx < 9:
ax = axes[idx // 3, idx % 3]
vis_df.boxplot(column=feat, by='loan_status', ax=ax)
ax.set_title(f'{feat} by Loan Status')
ax.set_xlabel('Loan Status (0=Good, 1=Default)')
ax.set_ylabel(feat)
plt.sca(ax)
plt.xticks([1, 2], ['0', '1'])
# Remove empty subplot
if len(engineered_numeric) < 9:
fig.delaxes(axes[2, 2])
plt.tight_layout()
plt.savefig('artifacts/engineered_features_distribution.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/engineered_features_distribution.png")
# Correlation of engineered features with target
engineered_corr = vis_df[engineered_numeric + ['loan_status']].corr()['loan_status'].sort_values(ascending=False)
print("\nCorrelation of engineered features with target:")
print(engineered_corr)
plt.figure(figsize=(10, 6))
engineered_corr.drop('loan_status').plot(kind='barh', color='steelblue', edgecolor='black')
plt.title('Engineered Features Correlation with Loan Default', fontsize=14, fontweight='bold')
plt.xlabel('Pearson Correlation', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.axvline(x=0, color='red', linestyle='--', linewidth=1)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('artifacts/engineered_features_correlation.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/engineered_features_correlation.png")
print("\n======= FEATURE ENGINEERING COMPLETE =======")
print(f" Ready for modeling with {X_train_transformed.shape[1]} total features")
======= FEATURE ENGINEERING ======= 1. CREATING ENGINEERED FEATURES ================================================== Created: dti_ratio = loan_amnt / person_income Created: total_loan_cost = loan_amnt × loan_int_rate Created: income_to_loan = person_income / loan_amnt Created: age_to_cred_hist = person_age / cb_person_cred_hist_length Created: employment_stability = person_emp_length × log(person_income) Created: log_income = log(1 + person_income) Created: log_loan_amnt = log(1 + loan_amnt) Created: dti_bucket = binned(dti_ratio) [low, medium, high, very_high] Created: age_group = binned(person_age) [young, mid_career, established, senior] Created: income_quartile = binned(person_income) [Q1, Q2, Q3, Q4] Created: risk_profile = (is_renter) + (has_default_history) Total engineered features created: 11 Original features: 11 Engineered features: 11 Total features: 22 2. UPDATING PREPROCESSING PIPELINE ================================================== Numeric features (15): ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length', 'dti_ratio', 'total_loan_cost', 'income_to_loan', 'age_to_cred_hist', 'employment_stability', 'log_income', 'log_loan_amnt', 'risk_profile'] Categorical features (7): ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file', 'dti_bucket', 'age_group', 'income_quartile'] Engineered preprocessor fitted Transformed train shape: (25932, 46) Transformed test shape: (6484, 46) 3. PERSISTING ARTIFACTS ================================================== Engineered preprocessor saved to: models/preprocessor_engineered.joblib Feature provenance saved to: artifacts/feature_provenance.json Updated X_train_transformed: (25932, 46) Updated X_test_transformed: (6484, 46) 4. FEATURE ENGINEERING RATIONALE ================================================== Feature Engineering Decisions: • dti_ratio: Core credit metric: higher DTI = higher default risk • total_loan_cost: Captures true loan burden (principal + interest) • income_to_loan: Affordability from inverse perspective • age_to_cred_hist: Financial maturity: younger age vs credit history may indicate instability • employment_stability: Combines job tenure and income level for stability proxy • log_income: Reduces right-skew in income distribution • log_loan_amnt: Reduces right-skew in loan amount distribution • dti_bucket: Captures nonlinear DTI thresholds (e.g., >40% DTI often flagged) • age_group: Life stage proxy: risk profiles differ by age bracket • income_quartile: Relative income position in population • risk_profile: Composite risk: renters with default history = highest risk 5. VISUALIZING ENGINEERED FEATURES ==================================================
Saved: artifacts/engineered_features_distribution.png Correlation of engineered features with target: loan_status 1.000000 dti_ratio 0.386180 risk_profile 0.290733 total_loan_cost 0.209052 log_loan_amnt 0.078961 age_to_cred_hist 0.018840 employment_stability -0.101168 income_to_loan -0.136450 log_income -0.281369 Name: loan_status, dtype: float64
Saved: artifacts/engineered_features_correlation.png ======= FEATURE ENGINEERING COMPLETE ======= Ready for modeling with 46 total features
6. Final Dataset Characterization¶
- Report final training and test set dimensions after preprocessing and feature engineering
- Display class distribution for both training and test sets with counts and percentages
- Calculate class imbalance ratio (majority:minority) to flag potential modeling challenges
- Break down feature composition: original vs. engineered vs. one-hot encoded features
- Extract and categorize transformed feature names from the preprocessing pipeline
- Generate comprehensive post-preprocessing data dictionary with:
- Feature index, name, type (numeric_scaled / categorical_onehot)
- Provenance (original / engineered / derived)
- Basic statistics (mean, std, min, max, unique values)
- Save post-preprocessing data dictionary to CSV for reproducibility
- Confirm data quality: no missing values, standardized numerics, encoded categoricals
- Create final characterization dashboard with 4 visualizations:
- Class distribution comparison (train vs. test)
- Feature provenance pie chart
- Feature type distribution
- Sample feature statistics (scaled numeric features)
- Persist summary visualization to artifacts directory
- Prepare modeling-ready datasets with confirmed shapes, balanced splits, and full feature pipeline
# ========== 6. FINAL DATASET CHARACTERIZATION ==========
print("======= FINAL DATASET CHARACTERIZATION =======\n")
print("1. DATASET DIMENSIONS")
print("="*50)
print(f"Training set shape: {X_train_transformed.shape}")
print(f"Test set shape: {X_test_transformed.shape}")
print(f" - Training samples: {X_train_transformed.shape[0]:,}")
print(f" - Test samples: {X_test_transformed.shape[0]:,}")
print(f" - Total features (after preprocessing): {X_train_transformed.shape[1]}")
print("\n2. CLASS BALANCE")
print("="*50)
print("Training set:")
train_class_counts = y_train.value_counts().sort_index()
train_class_pcts = y_train.value_counts(normalize=True).sort_index()
for cls in train_class_counts.index:
print(f" Class {cls}: {train_class_counts[cls]:,} ({train_class_pcts[cls]:.2%})")
print("\nTest set:")
test_class_counts = y_test.value_counts().sort_index()
test_class_pcts = y_test.value_counts(normalize=True).sort_index()
for cls in test_class_counts.index:
print(f" Class {cls}: {test_class_counts[cls]:,} ({test_class_pcts[cls]:.2%})")
# Calculate imbalance ratio
imbalance_ratio = train_class_counts[0] / train_class_counts[1]
print(f"\nImbalance ratio (class 0 : class 1): {imbalance_ratio:.2f}:1")
print(f" Dataset is imbalanced - consider class_weight='balanced' for appropriate models")
print("\n3. FEATURE COMPOSITION")
print("="*50)
print(f"Original features: {len(feature_provenance['original'])}")
print(f"Engineered features: {len(feature_provenance['engineered'])}")
print(f"Total input features (before encoding): {len(feature_provenance['original']) + len(feature_provenance['engineered'])}")
print(f"Total features (after one-hot encoding): {X_train_transformed.shape[1]}")
# Get feature names from preprocessor
try:
all_feature_names = preprocessor.get_feature_names_out()
# Categorize by type
numeric_encoded = [f for f in all_feature_names if f.startswith('num__')]
categorical_encoded = [f for f in all_feature_names if f.startswith('cat__')]
print(f"\nFeature type breakdown after preprocessing:")
print(f" - Numeric features (scaled): {len(numeric_encoded)}")
print(f" - Categorical features (one-hot encoded): {len(categorical_encoded)}")
except Exception as e:
print(f"Note: Could not extract detailed feature names: {e}")
all_feature_names = [f"feature_{i}" for i in range(X_train_transformed.shape[1])]
numeric_encoded = []
categorical_encoded = []
print("\n4. POST-PREPROCESSING DATA DICTIONARY")
print("="*50)
# Create comprehensive post-prep data dictionary
post_prep_dict = []
# Add info about transformed features
for idx, feat_name in enumerate(all_feature_names):
# Determine if original or engineered
base_name = feat_name.split('__')[1] if '__' in feat_name else feat_name
# Check provenance
if base_name in feature_provenance['original']:
provenance = 'original'
elif base_name in feature_provenance['engineered']:
provenance = 'engineered'
elif any(base_name.startswith(orig) for orig in feature_provenance['original']):
provenance = 'original_encoded'
elif any(base_name.startswith(eng) for eng in feature_provenance['engineered']):
provenance = 'engineered_encoded'
else:
provenance = 'derived'
# Determine type
if feat_name.startswith('num__'):
feat_type = 'numeric_scaled'
elif feat_name.startswith('cat__'):
feat_type = 'categorical_onehot'
else:
feat_type = 'unknown'
# Get basic stats from training data
feat_values = X_train_transformed[:, idx]
post_prep_dict.append({
'index': idx,
'feature_name': feat_name,
'type': feat_type,
'provenance': provenance,
'mean': np.mean(feat_values),
'std': np.std(feat_values),
'min': np.min(feat_values),
'max': np.max(feat_values),
'n_unique': len(np.unique(feat_values))
})
post_prep_df = pd.DataFrame(post_prep_dict)
# Display summary
print(f"\nTotal features in transformed dataset: {len(post_prep_df)}")
print(f"\nProvenance breakdown:")
print(post_prep_df['provenance'].value_counts())
print(f"\nType breakdown:")
print(post_prep_df['type'].value_counts())
# Save to CSV
post_prep_dict_path = 'artifacts/data_dictionary_post_preprocessing.csv'
post_prep_df.to_csv(post_prep_dict_path, index=False)
print(f"\n Post-preprocessing data dictionary saved to: {post_prep_dict_path}")
# Display first few rows
print(f"\nSample of post-preprocessing features:")
print(post_prep_df.head(20).to_string())
print("\n5. DATA QUALITY SUMMARY")
print("="*50)
print(" No missing values (imputed during preprocessing)")
print(" All numeric features standardized (mean≈0, std≈1)")
print(" Categorical features one-hot encoded")
print(" Outliers capped using domain rules")
print(" Duplicates removed")
print(f" Stratified train-test split maintains class balance")
print("\n6. READY FOR MODELING")
print("="*50)
print(f" Training set: {X_train_transformed.shape[0]:,} samples × {X_train_transformed.shape[1]} features")
print(f" Test set: {X_test_transformed.shape[0]:,} samples × {X_test_transformed.shape[1]} features")
print(f" Target: Binary classification (0=No Default, 1=Default)")
print(f" Imbalance ratio: {imbalance_ratio:.2f}:1")
print(f" All preprocessing artifacts saved to 'artifacts/' directory")
print(f" Trained preprocessor saved to 'models/' directory")
# Create summary visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Final Dataset Characterization Summary', fontsize=16, fontweight='bold')
# Plot 1: Class distribution comparison
ax1 = axes[0, 0]
x_pos = np.arange(2)
width = 0.35
ax1.bar(x_pos - width/2, train_class_counts.values, width, label='Train', alpha=0.8)
ax1.bar(x_pos + width/2, test_class_counts.values, width, label='Test', alpha=0.8)
ax1.set_xlabel('Class', fontweight='bold')
ax1.set_ylabel('Count', fontweight='bold')
ax1.set_title('Class Distribution: Train vs Test')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(['No Default (0)', 'Default (1)'])
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
# Plot 2: Feature provenance
ax2 = axes[0, 1]
provenance_counts = post_prep_df['provenance'].value_counts()
colors = plt.cm.Set3(range(len(provenance_counts)))
ax2.pie(provenance_counts.values, labels=provenance_counts.index, autopct='%1.1f%%',
colors=colors, startangle=90)
ax2.set_title('Feature Provenance')
# Plot 3: Feature type distribution
ax3 = axes[1, 0]
type_counts = post_prep_df['type'].value_counts()
ax3.bar(range(len(type_counts)), type_counts.values, color='steelblue', alpha=0.8)
ax3.set_xlabel('Feature Type', fontweight='bold')
ax3.set_ylabel('Count', fontweight='bold')
ax3.set_title('Feature Type Distribution')
ax3.set_xticks(x_pos)
ax3.set_xticklabels(type_counts.index, ha='right')
ax3.grid(axis='y', alpha=0.3)
# Plot 4: Sample feature statistics (first 20 numeric features)
ax4 = axes[1, 1]
numeric_features_sample = post_prep_df[post_prep_df['type'] == 'numeric_scaled'].head(15)
if len(numeric_features_sample) > 0:
x_pos = np.arange(len(numeric_features_sample))
ax4.barh(x_pos, numeric_features_sample['std'].values, color='coral', alpha=0.8)
ax4.set_yticks(x_pos)
ax4.set_yticklabels([name.split('__')[1] if '__' in name else name
for name in numeric_features_sample['feature_name']], fontsize=8)
ax4.set_xlabel('Standard Deviation', fontweight='bold')
ax4.set_title('Std Dev of Scaled Numeric Features (Sample)')
ax4.axvline(x=1, color='red', linestyle='--', linewidth=1, alpha=0.7, label='Expected≈1')
ax4.legend(fontsize=8)
ax4.grid(axis='x', alpha=0.3)
else:
ax4.text(0.5, 0.5, 'No numeric features', ha='center', va='center')
ax4.axis('off')
plt.tight_layout()
plt.savefig('artifacts/final_dataset_characterization.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/final_dataset_characterization.png")
print("\n======= DATASET CHARACTERIZATION COMPLETE =======")
print(" Ready to proceed to model training (Section 7)")
======= FINAL DATASET CHARACTERIZATION =======
1. DATASET DIMENSIONS
==================================================
Training set shape: (25932, 46)
Test set shape: (6484, 46)
- Training samples: 25,932
- Test samples: 6,484
- Total features (after preprocessing): 46
2. CLASS BALANCE
==================================================
Training set:
Class 0: 20,261 (78.13%)
Class 1: 5,671 (21.87%)
Test set:
Class 0: 5,066 (78.13%)
Class 1: 1,418 (21.87%)
Imbalance ratio (class 0 : class 1): 3.57:1
Dataset is imbalanced - consider class_weight='balanced' for appropriate models
3. FEATURE COMPOSITION
==================================================
Original features: 11
Engineered features: 11
Total input features (before encoding): 22
Total features (after one-hot encoding): 46
Feature type breakdown after preprocessing:
- Numeric features (scaled): 15
- Categorical features (one-hot encoded): 31
4. POST-PREPROCESSING DATA DICTIONARY
==================================================
Total features in transformed dataset: 46
Provenance breakdown:
provenance
original_encoded 19
engineered_encoded 12
engineered 8
original 7
Name: count, dtype: int64
Type breakdown:
type
categorical_onehot 31
numeric_scaled 15
Name: count, dtype: int64
Post-preprocessing data dictionary saved to: artifacts/data_dictionary_post_preprocessing.csv
Sample of post-preprocessing features:
index feature_name type provenance mean std min max n_unique
0 0 num__person_age numeric_scaled original -2.100228e-16 1.000000 -1.233333 11.548405 54
1 1 num__person_income numeric_scaled original -9.617480e-17 1.000000 -0.979977 94.056159 3710
2 2 num__person_emp_length numeric_scaled original 5.671847e-17 1.000000 -1.191765 11.373735 33
3 3 num__loan_amnt numeric_scaled original 1.368641e-16 1.000000 -1.446230 4.041453 713
4 4 num__loan_int_rate numeric_scaled original 5.121103e-16 1.000000 -1.820897 3.968051 336
5 5 num__loan_percent_income numeric_scaled original 3.057866e-16 1.000000 -1.599242 6.174224 75
6 6 num__cb_person_cred_hist_length numeric_scaled original -6.302053e-17 1.000000 -0.938859 5.982390 29
7 7 num__dti_ratio numeric_scaled engineered -1.248080e-16 1.000000 -1.590350 6.153714 12762
8 8 num__total_loan_cost numeric_scaled engineered 0.000000e+00 1.000000 -1.207433 7.009202 8615
9 9 num__income_to_loan numeric_scaled engineered 6.096551e-18 1.000000 -0.663797 93.378779 12801
10 10 num__age_to_cred_hist numeric_scaled engineered 5.480046e-19 1.000000 -1.833511 15.797027 249
11 11 num__employment_stability numeric_scaled engineered 1.150810e-17 1.000000 -1.123593 12.250956 6933
12 12 num__log_income numeric_scaled engineered -1.657166e-15 1.000000 -4.664837 8.312460 3710
13 13 num__log_loan_amnt numeric_scaled engineered -4.038109e-16 1.000000 -3.846981 2.143262 713
14 14 num__risk_profile numeric_scaled engineered 9.740781e-17 1.000000 -1.052560 2.032491 3
15 15 cat__person_home_ownership_MORTGAGE categorical_onehot original_encoded 4.121549e-01 0.492223 0.000000 1.000000 2
16 16 cat__person_home_ownership_OTHER categorical_onehot original_encoded 3.393491e-03 0.058155 0.000000 1.000000 2
17 17 cat__person_home_ownership_OWN categorical_onehot original_encoded 7.874441e-02 0.269339 0.000000 1.000000 2
18 18 cat__person_home_ownership_RENT categorical_onehot original_encoded 5.057072e-01 0.499967 0.000000 1.000000 2
19 19 cat__loan_intent_DEBTCONSOLIDATION categorical_onehot original_encoded 1.611522e-01 0.367671 0.000000 1.000000 2
5. DATA QUALITY SUMMARY
==================================================
No missing values (imputed during preprocessing)
All numeric features standardized (mean≈0, std≈1)
Categorical features one-hot encoded
Outliers capped using domain rules
Duplicates removed
Stratified train-test split maintains class balance
6. READY FOR MODELING
==================================================
Training set: 25,932 samples × 46 features
Test set: 6,484 samples × 46 features
Target: Binary classification (0=No Default, 1=Default)
Imbalance ratio: 3.57:1
All preprocessing artifacts saved to 'artifacts/' directory
Trained preprocessor saved to 'models/' directory
Saved: artifacts/final_dataset_characterization.png ======= DATASET CHARACTERIZATION COMPLETE ======= Ready to proceed to model training (Section 7)
7. Candidate Models Overview¶
This project evaluates a broad set of supervised classification models to determine which algorithm best predicts credit risk, defined by the binary target loan_status (0 = good loan, 1 = default). The models selected reflect a diverse range of learning paradigms—including linear, probabilistic, instance-based, tree-based, ensemble, margin-based, and neural architectures—providing a comprehensive comparison across modeling families.
The following models were implemented and tuned using standardized preprocessing, consistent evaluation metrics, and stratified cross-validation:
Linear and Probabilistic Models¶
- Logistic Regression
A linear baseline classifier with L2 regularization and class weighting. - Linear Discriminant Analysis (LDA)
Assumes class-conditional Gaussian distributions with shared covariance. - Quadratic Discriminant Analysis (QDA)
Similar to LDA but allows class-specific covariance matrices. - Gaussian Naive Bayes
Probabilistic classifier assuming conditional independence of features.
Instance-Based and Distance-Based Model¶
- k-Nearest Neighbors (KNN)
Non-parametric classifier relying on distance metrics in transformed feature space.
Tree-Based Models¶
- Decision Tree (CART)
Interpretable baseline tree with cost-complexity pruning. - Bagging (Bootstrap Aggregation)
Ensemble of decision trees trained on resampled datasets. - Random Forest
Bagged decision trees with feature subsampling to reduce correlation. - AdaBoost
Sequential boosting of weak learners (decision stumps). - Gradient Boosting
Additive ensemble optimizing residual errors via shallow decision trees.
Margin-Based Models¶
- Linear SVM
Maximum-margin linear classifier calibrated for probability outputs. - RBF SVM
Nonlinear kernel-based classifier with optimized hyperparameters.
Neural Network Model¶
- MLP (Feedforward Neural Network)
Fully connected network trained with backpropagation to learn nonlinear patterns.
All models were trained using identical training and test splits, the same preprocessing pipeline, and consistent scoring metrics. Hyperparameters were selected using cross-validation, and final models were compared on out-of-sample performance using AUC, accuracy, sensitivity, and specificity. This systematic framework ensures an unbiased and informative comparison of model performance for credit risk prediction.
7.0 Baseline — DummyClassifier¶
Purpose & Approach:
- Establishes a naïve baseline using a
DummyClassifierwith stratified random predictions based on training class distribution - Serves as the minimum performance threshold that any real model must exceed to be considered useful
- Provides a reference point for measuring improvement achieved by actual machine learning algorithms
Configuration & Tuning:
- No hyperparameters to tune; strategy fixed to
'stratified'for class-proportional predictions - Expected AUC ≈ 0.50 for a truly random baseline in balanced scenarios
Evaluation & Comparison:
- Metrics include accuracy, ROC-AUC, sensitivity, specificity, and confusion matrix
- ROC curve plotted to visualize performance against diagonal (random chance line)
- All subsequent models are ranked against this baseline to quantify predictive value gained from learning algorithms
# ========== 7.0 BASELINE — DUMMYCLASSIFIER ==========
print("======= 7.0 BASELINE — DUMMYCLASSIFIER =======\n")
# Define the baseline model
# Strategy options: 'most_frequent', 'stratified', 'uniform', 'constant'
# 'stratified' matches class distribution and is a reasonable baseline
dummy_baseline = DummyClassifier(strategy='stratified', random_state=RANDOM_STATE)
print("Training DummyClassifier (Baseline)...")
print("Strategy: 'stratified' (predicts based on training class distribution)")
print("\n No hyperparameters to tune for baseline model")
# Fit on training data
dummy_baseline.fit(X_train_transformed, y_train)
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_dummy = dummy_baseline.predict(X_train_transformed)
y_test_pred_dummy = dummy_baseline.predict(X_test_transformed)
y_train_proba_dummy = dummy_baseline.predict_proba(X_train_transformed)[:, 1]
y_test_proba_dummy = dummy_baseline.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_dummy = accuracy_score(y_train, y_train_pred_dummy)
test_acc_dummy = accuracy_score(y_test, y_test_pred_dummy)
# AUC
train_auc_dummy = roc_auc_score(y_train, y_train_proba_dummy)
test_auc_dummy = roc_auc_score(y_test, y_test_proba_dummy)
# Confusion matrices for Sensitivity/Specificity
cm_train_dummy = confusion_matrix(y_train, y_train_pred_dummy)
cm_test_dummy = confusion_matrix(y_test, y_test_pred_dummy)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_dummy = cm_train_dummy[0, 0] / (cm_train_dummy[0, 0] + cm_train_dummy[0, 1]) if (cm_train_dummy[0, 0] + cm_train_dummy[0, 1]) > 0 else 0
train_specificity_dummy = cm_train_dummy[1, 1] / (cm_train_dummy[1, 0] + cm_train_dummy[1, 1]) if (cm_train_dummy[1, 0] + cm_train_dummy[1, 1]) > 0 else 0
test_sensitivity_dummy = cm_test_dummy[0, 0] / (cm_test_dummy[0, 0] + cm_test_dummy[0, 1]) if (cm_test_dummy[0, 0] + cm_test_dummy[0, 1]) > 0 else 0
test_specificity_dummy = cm_test_dummy[1, 1] / (cm_test_dummy[1, 0] + cm_test_dummy[1, 1]) if (cm_test_dummy[1, 0] + cm_test_dummy[1, 1]) > 0 else 0
# Pack metrics
metrics_dummy = {
'Model': 'DummyClassifier (Baseline)',
'Train Accuracy': train_acc_dummy,
'Test Accuracy': test_acc_dummy,
'Train AUC': train_auc_dummy,
'Test AUC': test_auc_dummy,
'Train Sensitivity': train_sensitivity_dummy,
'Test Sensitivity': test_sensitivity_dummy,
'Train Specificity': train_specificity_dummy,
'Test Specificity': test_specificity_dummy
}
# Display metrics
metrics_df_dummy = pd.DataFrame([metrics_dummy])
print("\nMetrics Summary:")
print(metrics_df_dummy.to_string(index=False))
print("\n BASELINE INTERPRETATION:")
print(f" This model predicts randomly based on class distribution.")
print(f" Any real model MUST outperform these metrics to be useful.")
print(f" Expected AUC ≈ 0.50 for a truly random baseline.")
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_dummy, tpr_train_dummy, _ = roc_curve(y_train, y_train_proba_dummy)
fpr_test_dummy, tpr_test_dummy, _ = roc_curve(y_test, y_test_proba_dummy)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_dummy, tpr_train_dummy, label=f'Train (AUC = {train_auc_dummy:.3f})', linewidth=2, alpha=0.7)
plt.plot(fpr_test_dummy, tpr_test_dummy, label=f'Test (AUC = {test_auc_dummy:.3f})', linewidth=2, alpha=0.7)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — DummyClassifier (Baseline)', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_dummy_baseline.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_dummy_baseline.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save baseline model
model_path_dummy = 'models/dummy_baseline.joblib'
dump(dummy_baseline, model_path_dummy)
print(f" Baseline model saved to: {model_path_dummy}")
# Save metrics
metrics_path_dummy = 'artifacts/dummy_baseline_metrics.json'
with open(metrics_path_dummy, 'w') as f:
json.dump(metrics_dummy, f, indent=2)
print(f" Metrics saved to: {metrics_path_dummy}")
print("\n======= DUMMYCLASSIFIER (BASELINE) COMPLETE =======\n")
# Return for potential downstream use
tuned_model_dummy = dummy_baseline
best_params_dummy = {} # No hyperparameters
metrics_dict_dummy = metrics_dummy
======= 7.0 BASELINE — DUMMYCLASSIFIER =======
Training DummyClassifier (Baseline)...
Strategy: 'stratified' (predicts based on training class distribution)
No hyperparameters to tune for baseline model
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
DummyClassifier (Baseline) 0.655676 0.654226 0.496105 0.492562 0.779725 0.779905 0.212485 0.205219
BASELINE INTERPRETATION:
This model predicts randomly based on class distribution.
Any real model MUST outperform these metrics to be useful.
Expected AUC ≈ 0.50 for a truly random baseline.
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_dummy_baseline.png ================================================== SAVING ARTIFACTS ================================================== Baseline model saved to: models/dummy_baseline.joblib Metrics saved to: artifacts/dummy_baseline_metrics.json ======= DUMMYCLASSIFIER (BASELINE) COMPLETE =======
7.1 Logistic Regression¶
Purpose & Approach:
- Establishes a linear baseline for binary classification using regularized logistic regression with L2 penalty
- Balances class weights to address the dataset's imbalance (78% no-default, 22% default)
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored regularization strength (
C), penalty type (l2,None), solvers (liblinear,saga), and class weighting - Selected configuration:
C=100,class_weight='balanced',solver='liblinear'
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a strong parametric benchmark against tree-based, probabilistic, and kernel methods
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis
# ========== 7.1 LOGISTIC REGRESSION ==========
print("======= 7.1 LOGISTIC REGRESSION =======\n")
# Define the model
log_reg = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
# Define hyperparameter grid
param_grid_lr = {
'penalty': ['l1', 'l2', 'elasticnet', None],
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'solver': ['liblinear', 'saga'],
'class_weight': [None, 'balanced']
}
# Note: elasticnet requires saga solver and l1_ratio parameter
# We'll use a more focused grid to avoid invalid combinations
param_grid_lr = {
'penalty': ['l2', None],
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'solver': ['liblinear', 'saga'],
'class_weight': [None, 'balanced']
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# GridSearchCV
grid_search_lr = GridSearchCV(
estimator=log_reg,
param_grid=param_grid_lr,
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
print("Training Logistic Regression with GridSearchCV...")
grid_search_lr.fit(X_train_transformed, y_train)
# Best model
best_lr = grid_search_lr.best_estimator_
best_params_lr = grid_search_lr.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_lr.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred = best_lr.predict(X_train_transformed)
y_test_pred = best_lr.predict(X_test_transformed)
y_train_proba = best_lr.predict_proba(X_train_transformed)[:, 1]
y_test_proba = best_lr.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
# AUC (using class 1 as positive)
train_auc = roc_auc_score(y_train, y_train_proba)
test_auc = roc_auc_score(y_test, y_test_proba)
# Confusion matrices for Sensitivity/Specificity
# Sensitivity = Recall for class 0 = TP_0 / (TP_0 + FN_0)
# Specificity = TN / (TN + FP) for class 0 = Recall for class 1
cm_train = confusion_matrix(y_train, y_train_pred)
cm_test = confusion_matrix(y_test, y_test_pred)
# For binary: cm = [[TN, FP], [FN, TP]] when pos_label=1
# We want sensitivity for class 0 (good loans): TN/(TN+FP)
# Specificity for class 0: TP/(TP+FN) — but this is recall for class 1
# Let's compute properly:
# Sensitivity (True Positive Rate for class 0) = cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (True Negative Rate for class 0) = cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity = cm_train[0, 0] / (cm_train[0, 0] + cm_train[0, 1]) if (cm_train[0, 0] + cm_train[0, 1]) > 0 else 0
train_specificity = cm_train[1, 1] / (cm_train[1, 0] + cm_train[1, 1]) if (cm_train[1, 0] + cm_train[1, 1]) > 0 else 0
test_sensitivity = cm_test[0, 0] / (cm_test[0, 0] + cm_test[0, 1]) if (cm_test[0, 0] + cm_test[0, 1]) > 0 else 0
test_specificity = cm_test[1, 1] / (cm_test[1, 0] + cm_test[1, 1]) if (cm_test[1, 0] + cm_test[1, 1]) > 0 else 0
# Pack metrics
metrics_lr = {
'Model': 'Logistic Regression',
'Train Accuracy': train_acc,
'Test Accuracy': test_acc,
'Train AUC': train_auc,
'Test AUC': test_auc,
'Train Sensitivity': train_sensitivity,
'Test Sensitivity': test_sensitivity,
'Train Specificity': train_specificity,
'Test Specificity': test_specificity
}
# Display metrics
metrics_df_lr = pd.DataFrame([metrics_lr])
print("\nMetrics Summary:")
print(metrics_df_lr.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_proba)
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_proba)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train, tpr_train, label=f'Train (AUC = {train_auc:.3f})', linewidth=2)
plt.plot(fpr_test, tpr_test, label=f'Test (AUC = {test_auc:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Logistic Regression', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_logistic_regression.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_logistic_regression.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_lr = 'models/logistic_regression_best.joblib'
dump(best_lr, model_path_lr)
print(f" Best model saved to: {model_path_lr}")
# Save best params
params_path_lr = 'artifacts/logistic_regression_best_params.json'
with open(params_path_lr, 'w') as f:
json.dump(best_params_lr, f, indent=2)
print(f" Best params saved to: {params_path_lr}")
# Save metrics
metrics_path_lr = 'artifacts/logistic_regression_metrics.json'
with open(metrics_path_lr, 'w') as f:
json.dump(metrics_lr, f, indent=2)
print(f" Metrics saved to: {metrics_path_lr}")
print("\n======= LOGISTIC REGRESSION COMPLETE =======\n")
# Return for potential downstream use
tuned_model_lr = best_lr
best_params_lr = best_params_lr
metrics_dict_lr = metrics_lr
======= 7.1 LOGISTIC REGRESSION ======= Training Logistic Regression with GridSearchCV... Fitting 5 folds for each of 48 candidates, totalling 240 fits
c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py:516: FitFailedWarning:
60 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\base.py", line 1365, in wrapper
return fit_method(estimator, *args, **kwargs)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\linear_model\_logistic.py", line 1218, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\linear_model\_logistic.py", line 77, in _check_solver
raise ValueError("penalty=None is not supported for the liblinear solver")
ValueError: penalty=None is not supported for the liblinear solver
warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_search.py:1135: UserWarning: One or more of the test scores are non-finite: [0.85919831 0.85909336 nan 0.88584905 0.86334137 0.86350074
nan 0.88845927 0.87692632 0.87678989 nan 0.88584905
0.87978998 0.87976784 nan 0.88845927 0.88428289 0.88382906
nan 0.88584905 0.88678479 0.88633144 nan 0.88845927
0.88873783 0.88561236 nan 0.88584905 0.8909964 0.88822755
nan 0.88845927 0.88961479 0.88584205 nan 0.88584905
0.89169817 0.88844443 nan 0.88845927 0.88961366 0.88584879
nan 0.88584905 0.89172642 0.88845975 nan 0.88845927]
warnings.warn(
Best Hyperparameters:
C: 100
class_weight: balanced
penalty: l2
solver: liblinear
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Logistic Regression 0.834529 0.830352 0.893121 0.890529 0.843048 0.842479 0.804091 0.787024
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_logistic_regression.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/logistic_regression_best.joblib Best params saved to: artifacts/logistic_regression_best_params.json Metrics saved to: artifacts/logistic_regression_metrics.json ======= LOGISTIC REGRESSION COMPLETE =======
7.2 Linear Discriminant Analysis (LDA)¶
Purpose & Approach:
- Establishes a linear probabilistic baseline for binary classification using Gaussian class-conditional distributions with shared covariance
- Assumes each class follows a multivariate normal distribution, providing closed-form decision boundaries
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored solver types (
svd,lsqr,eigen) and shrinkage parameters to handle potential multicollinearity - Selected configuration balances computational efficiency with regularization to prevent overfitting
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a parametric probabilistic benchmark to compare against nonparametric and nonlinear methods
- Performance ranked in Section 10 leaderboard by test AUC with generalization gap analysis
# ========== 7.2 LINEAR DISCRIMINANT ANALYSIS (LDA) ==========
print("======= 7.2 LINEAR DISCRIMINANT ANALYSIS (LDA) =======\n")
# Define the model
lda = LinearDiscriminantAnalysis()
# Define hyperparameter grid
# LDA key parameters: solver, shrinkage (only with lsqr/eigen)
param_grid_lda = {
'solver': ['svd', 'lsqr', 'eigen'],
'shrinkage': [None, 'auto', 0.1, 0.5, 0.9]
}
# Note: shrinkage is incompatible with 'svd' solver
# We'll use a more careful grid to avoid invalid combinations
param_grid_lda = [
{'solver': ['svd']}, # svd doesn't support shrinkage
{'solver': ['lsqr', 'eigen'], 'shrinkage': [None, 'auto', 0.1, 0.5, 0.9]}
]
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# GridSearchCV
grid_search_lda = GridSearchCV(
estimator=lda,
param_grid=param_grid_lda,
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
print("Training LDA with GridSearchCV...")
grid_search_lda.fit(X_train_transformed, y_train)
# Best model
best_lda = grid_search_lda.best_estimator_
best_params_lda = grid_search_lda.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_lda.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_lda = best_lda.predict(X_train_transformed)
y_test_pred_lda = best_lda.predict(X_test_transformed)
y_train_proba_lda = best_lda.predict_proba(X_train_transformed)[:, 1]
y_test_proba_lda = best_lda.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_lda = accuracy_score(y_train, y_train_pred_lda)
test_acc_lda = accuracy_score(y_test, y_test_pred_lda)
# AUC
train_auc_lda = roc_auc_score(y_train, y_train_proba_lda)
test_auc_lda = roc_auc_score(y_test, y_test_proba_lda)
# Confusion matrices for Sensitivity/Specificity
cm_train_lda = confusion_matrix(y_train, y_train_pred_lda)
cm_test_lda = confusion_matrix(y_test, y_test_pred_lda)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_lda = cm_train_lda[0, 0] / (cm_train_lda[0, 0] + cm_train_lda[0, 1]) if (cm_train_lda[0, 0] + cm_train_lda[0, 1]) > 0 else 0
train_specificity_lda = cm_train_lda[1, 1] / (cm_train_lda[1, 0] + cm_train_lda[1, 1]) if (cm_train_lda[1, 0] + cm_train_lda[1, 1]) > 0 else 0
test_sensitivity_lda = cm_test_lda[0, 0] / (cm_test_lda[0, 0] + cm_test_lda[0, 1]) if (cm_test_lda[0, 0] + cm_test_lda[0, 1]) > 0 else 0
test_specificity_lda = cm_test_lda[1, 1] / (cm_test_lda[1, 0] + cm_test_lda[1, 1]) if (cm_test_lda[1, 0] + cm_test_lda[1, 1]) > 0 else 0
# Pack metrics
metrics_lda = {
'Model': 'Linear Discriminant Analysis',
'Train Accuracy': train_acc_lda,
'Test Accuracy': test_acc_lda,
'Train AUC': train_auc_lda,
'Test AUC': test_auc_lda,
'Train Sensitivity': train_sensitivity_lda,
'Test Sensitivity': test_sensitivity_lda,
'Train Specificity': train_specificity_lda,
'Test Specificity': test_specificity_lda
}
# Display metrics
metrics_df_lda = pd.DataFrame([metrics_lda])
print("\nMetrics Summary:")
print(metrics_df_lda.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_lda, tpr_train_lda, _ = roc_curve(y_train, y_train_proba_lda)
fpr_test_lda, tpr_test_lda, _ = roc_curve(y_test, y_test_proba_lda)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_lda, tpr_train_lda, label=f'Train (AUC = {train_auc_lda:.3f})', linewidth=2)
plt.plot(fpr_test_lda, tpr_test_lda, label=f'Test (AUC = {test_auc_lda:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Linear Discriminant Analysis', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_lda.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_lda.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_lda = 'models/lda_best.joblib'
dump(best_lda, model_path_lda)
print(f" Best model saved to: {model_path_lda}")
# Save best params
params_path_lda = 'artifacts/lda_best_params.json'
with open(params_path_lda, 'w') as f:
json.dump(best_params_lda, f, indent=2)
print(f" Best params saved to: {params_path_lda}")
# Save metrics
metrics_path_lda = 'artifacts/lda_metrics.json'
with open(metrics_path_lda, 'w') as f:
json.dump(metrics_lda, f, indent=2)
print(f" Metrics saved to: {metrics_path_lda}")
print("\n======= LINEAR DISCRIMINANT ANALYSIS COMPLETE =======\n")
# Return for potential downstream use
tuned_model_lda = best_lda
best_params_lda = best_params_lda
metrics_dict_lda = metrics_lda
======= 7.2 LINEAR DISCRIMINANT ANALYSIS (LDA) ======= Training LDA with GridSearchCV... Fitting 5 folds for each of 11 candidates, totalling 55 fits
c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py:516: FitFailedWarning:
5 fits failed out of a total of 55.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\base.py", line 1365, in wrapper
return fit_method(estimator, *args, **kwargs)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 716, in fit
self._solve_eigen(
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 549, in _solve_eigen
evals, evecs = linalg.eigh(Sb, Sw)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\scipy\linalg\_decomp.py", line 592, in eigh
raise LinAlgError(f'The leading minor of order {info-n} of B is not '
numpy.linalg.LinAlgError: The leading minor of order 19 of B is not positive definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed.
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\base.py", line 1365, in wrapper
return fit_method(estimator, *args, **kwargs)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 716, in fit
self._solve_eigen(
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 549, in _solve_eigen
evals, evecs = linalg.eigh(Sb, Sw)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\scipy\linalg\_decomp.py", line 592, in eigh
raise LinAlgError(f'The leading minor of order {info-n} of B is not '
numpy.linalg.LinAlgError: The leading minor of order 33 of B is not positive definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed.
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\base.py", line 1365, in wrapper
return fit_method(estimator, *args, **kwargs)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 716, in fit
self._solve_eigen(
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 549, in _solve_eigen
evals, evecs = linalg.eigh(Sb, Sw)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\scipy\linalg\_decomp.py", line 592, in eigh
raise LinAlgError(f'The leading minor of order {info-n} of B is not '
numpy.linalg.LinAlgError: The leading minor of order 25 of B is not positive definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed.
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\base.py", line 1365, in wrapper
return fit_method(estimator, *args, **kwargs)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 716, in fit
self._solve_eigen(
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\discriminant_analysis.py", line 549, in _solve_eigen
evals, evecs = linalg.eigh(Sb, Sw)
File "c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\scipy\linalg\_decomp.py", line 592, in eigh
raise LinAlgError(f'The leading minor of order {info-n} of B is not '
numpy.linalg.LinAlgError: The leading minor of order 32 of B is not positive definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed.
warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\model_selection\_search.py:1135: UserWarning: One or more of the test scores are non-finite: [0.88398251 0.88399334 nan 0.87715674 0.87715674 0.87432636
0.87432636 0.85749255 0.85749255 0.82999991 0.82999991]
warnings.warn(
Best Hyperparameters:
shrinkage: None
solver: lsqr
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Linear Discriminant Analysis 0.87367 0.872764 0.884986 0.883134 0.944771 0.940979 0.619644 0.629055
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_lda.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/lda_best.joblib Best params saved to: artifacts/lda_best_params.json Metrics saved to: artifacts/lda_metrics.json ======= LINEAR DISCRIMINANT ANALYSIS COMPLETE =======
7.3 Quadratic Discriminant Analysis (QDA)¶
Purpose & Approach:
- Extends LDA by allowing class-specific covariance matrices to model nonlinear decision boundaries between classes
- Assumes each class follows a multivariate Gaussian distribution but relaxes the shared covariance assumption
- More flexible than LDA for datasets where classes have different variance structures
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored regularization parameter (
reg_param) to prevent overfitting from estimating separate covariance matrices - Selected configuration balances flexibility with stability to avoid singular covariance estimates
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a nonlinear probabilistic benchmark to assess whether relaxing the shared covariance assumption improves over LDA
- Performance ranked in Section 10 leaderboard by test AUC with generalization gap analysis
# ========== 7.3 QUADRATIC DISCRIMINANT ANALYSIS (QDA) ==========
print("======= 7.3 QUADRATIC DISCRIMINANT ANALYSIS (QDA) =======\n")
# Define the model
qda = QuadraticDiscriminantAnalysis()
# Define hyperparameter grid
# QDA key parameter: reg_param (regularization)
param_grid_qda = {
'reg_param': [0.0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9]
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# GridSearchCV
grid_search_qda = GridSearchCV(
estimator=qda,
param_grid=param_grid_qda,
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
print("Training QDA with GridSearchCV...")
grid_search_qda.fit(X_train_transformed, y_train)
# Best model
best_qda = grid_search_qda.best_estimator_
best_params_qda = grid_search_qda.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_qda.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_qda = best_qda.predict(X_train_transformed)
y_test_pred_qda = best_qda.predict(X_test_transformed)
y_train_proba_qda = best_qda.predict_proba(X_train_transformed)[:, 1]
y_test_proba_qda = best_qda.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_qda = accuracy_score(y_train, y_train_pred_qda)
test_acc_qda = accuracy_score(y_test, y_test_pred_qda)
# AUC
train_auc_qda = roc_auc_score(y_train, y_train_proba_qda)
test_auc_qda = roc_auc_score(y_test, y_test_proba_qda)
# Confusion matrices for Sensitivity/Specificity
cm_train_qda = confusion_matrix(y_train, y_train_pred_qda)
cm_test_qda = confusion_matrix(y_test, y_test_pred_qda)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_qda = cm_train_qda[0, 0] / (cm_train_qda[0, 0] + cm_train_qda[0, 1]) if (cm_train_qda[0, 0] + cm_train_qda[0, 1]) > 0 else 0
train_specificity_qda = cm_train_qda[1, 1] / (cm_train_qda[1, 0] + cm_train_qda[1, 1]) if (cm_train_qda[1, 0] + cm_train_qda[1, 1]) > 0 else 0
test_sensitivity_qda = cm_test_qda[0, 0] / (cm_test_qda[0, 0] + cm_test_qda[0, 1]) if (cm_test_qda[0, 0] + cm_test_qda[0, 1]) > 0 else 0
test_specificity_qda = cm_test_qda[1, 1] / (cm_test_qda[1, 0] + cm_test_qda[1, 1]) if (cm_test_qda[1, 0] + cm_test_qda[1, 1]) > 0 else 0
# Pack metrics
metrics_qda = {
'Model': 'Quadratic Discriminant Analysis',
'Train Accuracy': train_acc_qda,
'Test Accuracy': test_acc_qda,
'Train AUC': train_auc_qda,
'Test AUC': test_auc_qda,
'Train Sensitivity': train_sensitivity_qda,
'Test Sensitivity': test_sensitivity_qda,
'Train Specificity': train_specificity_qda,
'Test Specificity': test_specificity_qda
}
# Display metrics
metrics_df_qda = pd.DataFrame([metrics_qda])
print("\nMetrics Summary:")
print(metrics_df_qda.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_qda, tpr_train_qda, _ = roc_curve(y_train, y_train_proba_qda)
fpr_test_qda, tpr_test_qda, _ = roc_curve(y_test, y_test_proba_qda)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_qda, tpr_train_qda, label=f'Train (AUC = {train_auc_qda:.3f})', linewidth=2)
plt.plot(fpr_test_qda, tpr_test_qda, label=f'Test (AUC = {test_auc_qda:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Quadratic Discriminant Analysis', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_qda.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_qda.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_qda = 'models/qda_best.joblib'
dump(best_qda, model_path_qda)
print(f" Best model saved to: {model_path_qda}")
# Save best params
params_path_qda = 'artifacts/qda_best_params.json'
with open(params_path_qda, 'w') as f:
json.dump(best_params_qda, f, indent=2)
print(f" Best params saved to: {params_path_qda}")
# Save metrics
metrics_path_qda = 'artifacts/qda_metrics.json'
with open(metrics_path_qda, 'w') as f:
json.dump(metrics_qda, f, indent=2)
print(f" Metrics saved to: {metrics_path_qda}")
print("\n======= QUADRATIC DISCRIMINANT ANALYSIS COMPLETE =======\n")
# Return for potential downstream use
tuned_model_qda = best_qda
best_params_qda = best_params_qda
metrics_dict_qda = metrics_qda
======= 7.3 QUADRATIC DISCRIMINANT ANALYSIS (QDA) =======
Training QDA with GridSearchCV...
Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best Hyperparameters:
reg_param: 0.05
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Quadratic Discriminant Analysis 0.862795 0.866595 0.882744 0.879489 0.902769 0.907027 0.719979 0.722144
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_qda.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/qda_best.joblib Best params saved to: artifacts/qda_best_params.json Metrics saved to: artifacts/qda_metrics.json ======= QUADRATIC DISCRIMINANT ANALYSIS COMPLETE =======
7.4 Gaussian Naive Bayes¶
Purpose & Approach:
- Implements a probabilistic baseline using Gaussian Naive Bayes, which assumes feature independence and models each class with a Gaussian distribution
- Serves as a computationally efficient benchmark to assess whether feature independence assumptions hold for credit risk prediction
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored variance smoothing parameter (
var_smoothing) across log-spaced values from 1e-12 to 1e-6 to stabilize probability estimates - Selected configuration balances numerical stability with model flexibility
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a fast probabilistic benchmark to compare against models that capture feature dependencies
- Performance ranked in Section 10 leaderboard by test AUC, with analysis of whether independence assumptions penalize predictive power
# ========== 7.4 NAIVE BAYES (GAUSSIAN) ==========
print("======= 7.4 NAIVE BAYES (GAUSSIAN) =======\n")
# Define the model
gnb = GaussianNB()
# Define hyperparameter grid
# GaussianNB key parameter: var_smoothing (portion of largest variance added to all variances)
param_grid_gnb = {
'var_smoothing': np.logspace(-12, -6, 20) # Log-spaced values from 1e-12 to 1e-6
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# GridSearchCV
grid_search_gnb = GridSearchCV(
estimator=gnb,
param_grid=param_grid_gnb,
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
print("Training Gaussian Naive Bayes with GridSearchCV...")
grid_search_gnb.fit(X_train_transformed, y_train)
# Best model
best_gnb = grid_search_gnb.best_estimator_
best_params_gnb = grid_search_gnb.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_gnb.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_gnb = best_gnb.predict(X_train_transformed)
y_test_pred_gnb = best_gnb.predict(X_test_transformed)
y_train_proba_gnb = best_gnb.predict_proba(X_train_transformed)[:, 1]
y_test_proba_gnb = best_gnb.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_gnb = accuracy_score(y_train, y_train_pred_gnb)
test_acc_gnb = accuracy_score(y_test, y_test_pred_gnb)
# AUC
train_auc_gnb = roc_auc_score(y_train, y_train_proba_gnb)
test_auc_gnb = roc_auc_score(y_test, y_test_proba_gnb)
# Confusion matrices for Sensitivity/Specificity
cm_train_gnb = confusion_matrix(y_train, y_train_pred_gnb)
cm_test_gnb = confusion_matrix(y_test, y_test_pred_gnb)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_gnb = cm_train_gnb[0, 0] / (cm_train_gnb[0, 0] + cm_train_gnb[0, 1]) if (cm_train_gnb[0, 0] + cm_train_gnb[0, 1]) > 0 else 0
train_specificity_gnb = cm_train_gnb[1, 1] / (cm_train_gnb[1, 0] + cm_train_gnb[1, 1]) if (cm_train_gnb[1, 0] + cm_train_gnb[1, 1]) > 0 else 0
test_sensitivity_gnb = cm_test_gnb[0, 0] / (cm_test_gnb[0, 0] + cm_test_gnb[0, 1]) if (cm_test_gnb[0, 0] + cm_test_gnb[0, 1]) > 0 else 0
test_specificity_gnb = cm_test_gnb[1, 1] / (cm_test_gnb[1, 0] + cm_test_gnb[1, 1]) if (cm_test_gnb[1, 0] + cm_test_gnb[1, 1]) > 0 else 0
# Pack metrics
metrics_gnb = {
'Model': 'Gaussian Naive Bayes',
'Train Accuracy': train_acc_gnb,
'Test Accuracy': test_acc_gnb,
'Train AUC': train_auc_gnb,
'Test AUC': test_auc_gnb,
'Train Sensitivity': train_sensitivity_gnb,
'Test Sensitivity': test_sensitivity_gnb,
'Train Specificity': train_specificity_gnb,
'Test Specificity': test_specificity_gnb
}
# Display metrics
metrics_df_gnb = pd.DataFrame([metrics_gnb])
print("\nMetrics Summary:")
print(metrics_df_gnb.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_gnb, tpr_train_gnb, _ = roc_curve(y_train, y_train_proba_gnb)
fpr_test_gnb, tpr_test_gnb, _ = roc_curve(y_test, y_test_proba_gnb)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_gnb, tpr_train_gnb, label=f'Train (AUC = {train_auc_gnb:.3f})', linewidth=2)
plt.plot(fpr_test_gnb, tpr_test_gnb, label=f'Test (AUC = {test_auc_gnb:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Gaussian Naive Bayes', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_naive_bayes.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_naive_bayes.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_gnb = 'models/naive_bayes_best.joblib'
dump(best_gnb, model_path_gnb)
print(f" Best model saved to: {model_path_gnb}")
# Save best params
params_path_gnb = 'artifacts/naive_bayes_best_params.json'
with open(params_path_gnb, 'w') as f:
# Convert numpy types to native Python types for JSON serialization
serializable_params = {k: float(v) if isinstance(v, np.floating) else v
for k, v in best_params_gnb.items()}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_gnb}")
# Save metrics
metrics_path_gnb = 'artifacts/naive_bayes_metrics.json'
with open(metrics_path_gnb, 'w') as f:
json.dump(metrics_gnb, f, indent=2)
print(f" Metrics saved to: {metrics_path_gnb}")
print("\n======= GAUSSIAN NAIVE BAYES COMPLETE =======\n")
# Return for potential downstream use
tuned_model_gnb = best_gnb
best_params_gnb = best_params_gnb
metrics_dict_gnb = metrics_gnb
======= 7.4 NAIVE BAYES (GAUSSIAN) ======= Training Gaussian Naive Bayes with GridSearchCV... Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Hyperparameters:
var_smoothing: 2.6366508987303555e-08
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Gaussian Naive Bayes 0.845712 0.845774 0.852637 0.849821 0.930211 0.930715 0.543819 0.542313
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_naive_bayes.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/naive_bayes_best.joblib Best params saved to: artifacts/naive_bayes_best_params.json Metrics saved to: artifacts/naive_bayes_metrics.json ======= GAUSSIAN NAIVE BAYES COMPLETE =======
7.5 k-Nearest Neighbors (KNN)¶
Purpose & Approach:
- Implements a non-parametric instance-based classifier that predicts loan default by finding the k most similar training samples in transformed feature space
- Serves as a local learning benchmark to assess whether distance-based methods can capture credit risk patterns without explicit model parameters
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored number of neighbors (
n_neighbors), weighting schemes (uniform,distance), and distance metrics (euclidean,manhattan,minkowski) - Selected configuration balances bias-variance tradeoff and computational efficiency for large-scale credit data
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a non-parametric baseline to compare against parametric and ensemble methods
- Performance ranked in Section 10 leaderboard by test AUC, with analysis of potential overfitting due to memorization of training instances
# ========== 7.5 K-NEAREST NEIGHBORS (KNN) ==========
print("======= 7.5 K-NEAREST NEIGHBORS (KNN) =======\n")
# Define the model
knn = KNeighborsClassifier()
# Define hyperparameter grid
# KNN key parameters: n_neighbors, weights, metric
param_grid_knn = {
'n_neighbors': [3, 5, 7, 9, 11, 15, 21, 31],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# GridSearchCV
grid_search_knn = GridSearchCV(
estimator=knn,
param_grid=param_grid_knn,
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
print("Training KNN with GridSearchCV...")
grid_search_knn.fit(X_train_transformed, y_train)
# Best model
best_knn = grid_search_knn.best_estimator_
best_params_knn = grid_search_knn.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_knn.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_knn = best_knn.predict(X_train_transformed)
y_test_pred_knn = best_knn.predict(X_test_transformed)
y_train_proba_knn = best_knn.predict_proba(X_train_transformed)[:, 1]
y_test_proba_knn = best_knn.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_knn = accuracy_score(y_train, y_train_pred_knn)
test_acc_knn = accuracy_score(y_test, y_test_pred_knn)
# AUC
train_auc_knn = roc_auc_score(y_train, y_train_proba_knn)
test_auc_knn = roc_auc_score(y_test, y_test_proba_knn)
# Confusion matrices for Sensitivity/Specificity
cm_train_knn = confusion_matrix(y_train, y_train_pred_knn)
cm_test_knn = confusion_matrix(y_test, y_test_pred_knn)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_knn = cm_train_knn[0, 0] / (cm_train_knn[0, 0] + cm_train_knn[0, 1]) if (cm_train_knn[0, 0] + cm_train_knn[0, 1]) > 0 else 0
train_specificity_knn = cm_train_knn[1, 1] / (cm_train_knn[1, 0] + cm_train_knn[1, 1]) if (cm_train_knn[1, 0] + cm_train_knn[1, 1]) > 0 else 0
test_sensitivity_knn = cm_test_knn[0, 0] / (cm_test_knn[0, 0] + cm_test_knn[0, 1]) if (cm_test_knn[0, 0] + cm_test_knn[0, 1]) > 0 else 0
test_specificity_knn = cm_test_knn[1, 1] / (cm_test_knn[1, 0] + cm_test_knn[1, 1]) if (cm_test_knn[1, 0] + cm_test_knn[1, 1]) > 0 else 0
# Pack metrics
metrics_knn = {
'Model': 'K-Nearest Neighbors',
'Train Accuracy': train_acc_knn,
'Test Accuracy': test_acc_knn,
'Train AUC': train_auc_knn,
'Test AUC': test_auc_knn,
'Train Sensitivity': train_sensitivity_knn,
'Test Sensitivity': test_sensitivity_knn,
'Train Specificity': train_specificity_knn,
'Test Specificity': test_specificity_knn
}
# Display metrics
metrics_df_knn = pd.DataFrame([metrics_knn])
print("\nMetrics Summary:")
print(metrics_df_knn.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_knn, tpr_train_knn, _ = roc_curve(y_train, y_train_proba_knn)
fpr_test_knn, tpr_test_knn, _ = roc_curve(y_test, y_test_proba_knn)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_knn, tpr_train_knn, label=f'Train (AUC = {train_auc_knn:.3f})', linewidth=2)
plt.plot(fpr_test_knn, tpr_test_knn, label=f'Test (AUC = {test_auc_knn:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — K-Nearest Neighbors', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_knn.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_knn.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_knn = 'models/knn_best.joblib'
dump(best_knn, model_path_knn)
print(f" Best model saved to: {model_path_knn}")
# Save best params
params_path_knn = 'artifacts/knn_best_params.json'
with open(params_path_knn, 'w') as f:
json.dump(best_params_knn, f, indent=2)
print(f" Best params saved to: {params_path_knn}")
# Save metrics
metrics_path_knn = 'artifacts/knn_metrics.json'
with open(metrics_path_knn, 'w') as f:
json.dump(metrics_knn, f, indent=2)
print(f" Metrics saved to: {metrics_path_knn}")
print("\n======= K-NEAREST NEIGHBORS COMPLETE =======\n")
# Return for potential downstream use
tuned_model_knn = best_knn
best_params_knn = best_params_knn
metrics_dict_knn = metrics_knn
======= 7.5 K-NEAREST NEIGHBORS (KNN) =======
Training KNN with GridSearchCV...
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Hyperparameters:
metric: manhattan
n_neighbors: 31
weights: distance
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
K-Nearest Neighbors 1.0 0.885873 1.0 0.894739 1.0 0.985393 1.0 0.530324
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_knn.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/knn_best.joblib Best params saved to: artifacts/knn_best_params.json Metrics saved to: artifacts/knn_metrics.json ======= K-NEAREST NEIGHBORS COMPLETE =======
7.6 Decision Tree (CART)¶
Purpose & Approach:
- Implements a single decision tree classifier using Classification and Regression Tree (CART) algorithm with recursive binary splitting
- Serves as an interpretable baseline for tree-based methods, providing transparent decision rules that can be visualized and explained
- Prone to overfitting without regularization, making it useful for assessing the value of ensemble methods
Hyperparameter Tuning:
- Tuned via cost-complexity pruning using
ccp_alphavalues derived from the pruning path of a full tree - Explored tree depth (
max_depth), minimum samples for splits/leaves, and class weighting using RandomizedSearchCV with 50 parameter combinations - Selected configuration balances model complexity with generalization to prevent overfitting on training data
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a tree-based baseline to quantify the improvement gained from ensemble methods (Bagging, Random Forest, Boosting)
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis to assess regularization effectiveness
# ========== 7.6 DECISION TREE (CART) ==========
print("======= 7.6 DECISION TREE (CART) =======\n")
# Define the model
dt = DecisionTreeClassifier(random_state=RANDOM_STATE)
# ========== PRUNING PATH APPROACH ==========
# First, fit a full tree to get the pruning path
print("Computing cost-complexity pruning path...")
full_tree = DecisionTreeClassifier(random_state=RANDOM_STATE)
full_tree.fit(X_train_transformed, y_train)
path = full_tree.cost_complexity_pruning_path(X_train_transformed, y_train)
ccp_alphas = path.ccp_alphas
# Filter out extreme values (too small or too large)
# Use alphas from the middle range for tuning
ccp_alphas_filtered = ccp_alphas[(ccp_alphas > 0) & (ccp_alphas < ccp_alphas.max())]
# Sample evenly from the filtered range to avoid too many candidates
if len(ccp_alphas_filtered) > 20:
# Take every nth element to reduce search space
step = len(ccp_alphas_filtered) // 20
ccp_alphas_grid = ccp_alphas_filtered[::step]
else:
ccp_alphas_grid = ccp_alphas_filtered
print(f"Selected {len(ccp_alphas_grid)} ccp_alpha values for tuning")
# Define hyperparameter grid
param_grid_dt = {
'ccp_alpha': ccp_alphas_grid.tolist() + [0.0], # Include 0.0 for unpruned tree
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 10, 20],
'min_samples_leaf': [1, 5, 10],
'class_weight': [None, 'balanced']
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# Use RandomizedSearchCV for efficiency (Decision Trees have many parameters)
print("Training Decision Tree with RandomizedSearchCV...")
grid_search_dt = RandomizedSearchCV(
estimator=dt,
param_distributions=param_grid_dt,
n_iter=50, # Sample 50 combinations
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
grid_search_dt.fit(X_train_transformed, y_train)
# Best model
best_dt = grid_search_dt.best_estimator_
best_params_dt = grid_search_dt.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_dt.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_dt = best_dt.predict(X_train_transformed)
y_test_pred_dt = best_dt.predict(X_test_transformed)
y_train_proba_dt = best_dt.predict_proba(X_train_transformed)[:, 1]
y_test_proba_dt = best_dt.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_dt = accuracy_score(y_train, y_train_pred_dt)
test_acc_dt = accuracy_score(y_test, y_test_pred_dt)
# AUC
train_auc_dt = roc_auc_score(y_train, y_train_proba_dt)
test_auc_dt = roc_auc_score(y_test, y_test_proba_dt)
# Confusion matrices for Sensitivity/Specificity
cm_train_dt = confusion_matrix(y_train, y_train_pred_dt)
cm_test_dt = confusion_matrix(y_test, y_test_pred_dt)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_dt = cm_train_dt[0, 0] / (cm_train_dt[0, 0] + cm_train_dt[0, 1]) if (cm_train_dt[0, 0] + cm_train_dt[0, 1]) > 0 else 0
train_specificity_dt = cm_train_dt[1, 1] / (cm_train_dt[1, 0] + cm_train_dt[1, 1]) if (cm_train_dt[1, 0] + cm_train_dt[1, 1]) > 0 else 0
test_sensitivity_dt = cm_test_dt[0, 0] / (cm_test_dt[0, 0] + cm_test_dt[0, 1]) if (cm_test_dt[0, 0] + cm_test_dt[0, 1]) > 0 else 0
test_specificity_dt = cm_test_dt[1, 1] / (cm_test_dt[1, 0] + cm_test_dt[1, 1]) if (cm_test_dt[1, 0] + cm_test_dt[1, 1]) > 0 else 0
# Pack metrics
metrics_dt = {
'Model': 'Decision Tree',
'Train Accuracy': train_acc_dt,
'Test Accuracy': test_acc_dt,
'Train AUC': train_auc_dt,
'Test AUC': test_auc_dt,
'Train Sensitivity': train_sensitivity_dt,
'Test Sensitivity': test_sensitivity_dt,
'Train Specificity': train_specificity_dt,
'Test Specificity': test_specificity_dt
}
# Display metrics
metrics_df_dt = pd.DataFrame([metrics_dt])
print("\nMetrics Summary:")
print(metrics_df_dt.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_dt, tpr_train_dt, _ = roc_curve(y_train, y_train_proba_dt)
fpr_test_dt, tpr_test_dt, _ = roc_curve(y_test, y_test_proba_dt)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_dt, tpr_train_dt, label=f'Train (AUC = {train_auc_dt:.3f})', linewidth=2)
plt.plot(fpr_test_dt, tpr_test_dt, label=f'Test (AUC = {test_auc_dt:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Decision Tree', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_decision_tree.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_decision_tree.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_dt = 'models/decision_tree_best.joblib'
dump(best_dt, model_path_dt)
print(f" Best model saved to: {model_path_dt}")
# Save best params
params_path_dt = 'artifacts/decision_tree_best_params.json'
with open(params_path_dt, 'w') as f:
# Convert numpy types to native Python types for JSON serialization
serializable_params = {k: float(v) if isinstance(v, (np.floating, np.integer)) else v
for k, v in best_params_dt.items()}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_dt}")
# Save metrics
metrics_path_dt = 'artifacts/decision_tree_metrics.json'
with open(metrics_path_dt, 'w') as f:
json.dump(metrics_dt, f, indent=2)
print(f" Metrics saved to: {metrics_path_dt}")
print("\n======= DECISION TREE COMPLETE =======\n")
# Return for potential downstream use
tuned_model_dt = best_dt
best_params_dt = best_params_dt
metrics_dict_dt = metrics_dt
======= 7.6 DECISION TREE (CART) =======
Computing cost-complexity pruning path...
Selected 21 ccp_alpha values for tuning
Training Decision Tree with RandomizedSearchCV...
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Hyperparameters:
min_samples_split: 20
min_samples_leaf: 5
max_depth: None
class_weight: None
ccp_alpha: 0.0001187232033798559
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Decision Tree 0.937876 0.92736 0.930192 0.911335 0.991906 0.984998 0.744842 0.721439
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_decision_tree.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/decision_tree_best.joblib Best params saved to: artifacts/decision_tree_best_params.json Metrics saved to: artifacts/decision_tree_metrics.json ======= DECISION TREE COMPLETE =======
7.7 Bagging (Bootstrap Aggregation)¶
Purpose & Approach:
- Implements an ensemble method that reduces variance by training multiple decision trees on bootstrapped samples of the training data and averaging their predictions
- Serves as a variance reduction benchmark to assess whether aggregating independent weak learners improves over a single decision tree's performance
- Addresses overfitting by introducing diversity through random sampling of both training instances and features
Hyperparameter Tuning:
- Tuned via 5-fold stratified RandomizedSearchCV optimizing ROC-AUC with 50 parameter combinations
- Explored ensemble size (
n_estimators), bootstrap sampling rates (max_samples,max_features), feature/instance bootstrapping strategies, and base tree complexity (depth, split criteria, class weighting) - Selected configuration balances ensemble diversity with individual tree quality to maximize generalization
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a pure ensemble baseline to quantify variance reduction gains over single decision trees and to compare against Random Forest (which adds feature randomization) and boosting methods (which use sequential learning)
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis to evaluate ensemble stability
# ========== 7.7 BAGGING (DECISION TREE BASE) ==========
print("======= 7.7 BAGGING =======\n")
# Define the model
# BaggingClassifier with DecisionTreeClassifier as base estimator
bagging = BaggingClassifier(
estimator=DecisionTreeClassifier(random_state=RANDOM_STATE),
random_state=RANDOM_STATE
)
# Define hyperparameter grid
# Focus on n_estimators, max_samples, max_features, and bootstrap
param_distributions_bagging = {
'n_estimators': [10, 50, 100, 200],
'max_samples': [0.5, 0.7, 0.9, 1.0],
'max_features': [0.5, 0.7, 0.9, 1.0],
'bootstrap': [True, False],
'bootstrap_features': [False, True],
# Base estimator parameters
'estimator__max_depth': [None, 10, 20, 30],
'estimator__min_samples_split': [2, 5, 10],
'estimator__min_samples_leaf': [1, 2, 5],
'estimator__class_weight': [None, 'balanced']
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# Use RandomizedSearchCV for efficiency (many parameter combinations)
print("Training Bagging with RandomizedSearchCV...")
random_search_bagging = RandomizedSearchCV(
estimator=bagging,
param_distributions=param_distributions_bagging,
n_iter=50, # Sample 50 combinations
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
random_search_bagging.fit(X_train_transformed, y_train)
# Best model
best_bagging = random_search_bagging.best_estimator_
best_params_bagging = random_search_bagging.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_bagging.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_bagging = best_bagging.predict(X_train_transformed)
y_test_pred_bagging = best_bagging.predict(X_test_transformed)
y_train_proba_bagging = best_bagging.predict_proba(X_train_transformed)[:, 1]
y_test_proba_bagging = best_bagging.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_bagging = accuracy_score(y_train, y_train_pred_bagging)
test_acc_bagging = accuracy_score(y_test, y_test_pred_bagging)
# AUC
train_auc_bagging = roc_auc_score(y_train, y_train_proba_bagging)
test_auc_bagging = roc_auc_score(y_test, y_test_proba_bagging)
# Confusion matrices for Sensitivity/Specificity
cm_train_bagging = confusion_matrix(y_train, y_train_pred_bagging)
cm_test_bagging = confusion_matrix(y_test, y_test_pred_bagging)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_bagging = cm_train_bagging[0, 0] / (cm_train_bagging[0, 0] + cm_train_bagging[0, 1]) if (cm_train_bagging[0, 0] + cm_train_bagging[0, 1]) > 0 else 0
train_specificity_bagging = cm_train_bagging[1, 1] / (cm_train_bagging[1, 0] + cm_train_bagging[1, 1]) if (cm_train_bagging[1, 0] + cm_train_bagging[1, 1]) > 0 else 0
test_sensitivity_bagging = cm_test_bagging[0, 0] / (cm_test_bagging[0, 0] + cm_test_bagging[0, 1]) if (cm_test_bagging[0, 0] + cm_test_bagging[0, 1]) > 0 else 0
test_specificity_bagging = cm_test_bagging[1, 1] / (cm_test_bagging[1, 0] + cm_test_bagging[1, 1]) if (cm_test_bagging[1, 0] + cm_test_bagging[1, 1]) > 0 else 0
# Pack metrics
metrics_bagging = {
'Model': 'Bagging',
'Train Accuracy': train_acc_bagging,
'Test Accuracy': test_acc_bagging,
'Train AUC': train_auc_bagging,
'Test AUC': test_auc_bagging,
'Train Sensitivity': train_sensitivity_bagging,
'Test Sensitivity': test_sensitivity_bagging,
'Train Specificity': train_specificity_bagging,
'Test Specificity': test_specificity_bagging
}
# Display metrics
metrics_df_bagging = pd.DataFrame([metrics_bagging])
print("\nMetrics Summary:")
print(metrics_df_bagging.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_bagging, tpr_train_bagging, _ = roc_curve(y_train, y_train_proba_bagging)
fpr_test_bagging, tpr_test_bagging, _ = roc_curve(y_test, y_test_proba_bagging)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_bagging, tpr_train_bagging, label=f'Train (AUC = {train_auc_bagging:.3f})', linewidth=2)
plt.plot(fpr_test_bagging, tpr_test_bagging, label=f'Test (AUC = {test_auc_bagging:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Bagging', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_bagging.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_bagging.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_bagging = 'models/bagging_best.joblib'
dump(best_bagging, model_path_bagging)
print(f" Best model saved to: {model_path_bagging}")
# Save best params
params_path_bagging = 'artifacts/bagging_best_params.json'
with open(params_path_bagging, 'w') as f:
# Convert numpy types to native Python types for JSON serialization
serializable_params = {k: float(v) if isinstance(v, (np.floating, np.integer)) else v
for k, v in best_params_bagging.items()}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_bagging}")
# Save metrics
metrics_path_bagging = 'artifacts/bagging_metrics.json'
with open(metrics_path_bagging, 'w') as f:
json.dump(metrics_bagging, f, indent=2)
print(f" Metrics saved to: {metrics_path_bagging}")
print("\n======= BAGGING COMPLETE =======\n")
# Return for potential downstream use
tuned_model_bagging = best_bagging
best_params_bagging = best_params_bagging
metrics_dict_bagging = metrics_bagging
======= 7.7 BAGGING ======= Training Bagging with RandomizedSearchCV... Fitting 5 folds for each of 50 candidates, totalling 250 fits Best Hyperparameters: n_estimators: 200 max_samples: 0.9 max_features: 0.9 estimator__min_samples_split: 2 estimator__min_samples_leaf: 2 estimator__max_depth: 30 estimator__class_weight: None bootstrap_features: True bootstrap: False ================================================== COMPUTING METRICS ================================================== Metrics Summary: Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity Bagging 0.999807 0.934762 1.0 0.937623 1.0 0.992104 0.999118 0.729901 ================================================== PLOTTING ROC CURVES ==================================================
ROC curve saved to: Output/roc_curve_bagging.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/bagging_best.joblib Best params saved to: artifacts/bagging_best_params.json Metrics saved to: artifacts/bagging_metrics.json ======= BAGGING COMPLETE =======
7.8 Random Forest¶
Purpose & Approach:
- Implements an ensemble of decision trees using Random Forest, which combines bagging with random feature subsampling to reduce overfitting and improve generalization
- Serves as a high-performance tree ensemble baseline that typically outperforms single decision trees and basic bagging through decorrelation of individual trees
- More robust than single decision trees while maintaining interpretability through feature importance metrics
Hyperparameter Tuning:
- Tuned via 5-fold stratified RandomizedSearchCV optimizing ROC-AUC with 60 parameter combinations
- Explored ensemble size (
n_estimators), tree depth (max_depth), split criteria (min_samples_split,min_samples_leaf), feature sampling strategy (max_features), bootstrap sampling, and class weighting - Selected configuration balances forest size, tree complexity, and feature randomization to maximize predictive power while controlling variance
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a strong ensemble benchmark to compare against single trees, bagging (without feature randomization), and boosting methods (sequential learning)
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis to evaluate the effectiveness of random feature subsampling
# ========== 7.8 RANDOM FOREST ==========
print("======= 7.8 RANDOM FOREST =======\n")
# Define the model
rf = RandomForestClassifier(random_state=RANDOM_STATE)
# Define hyperparameter distribution for RandomizedSearchCV
# Random Forest benefits from larger search space explored efficiently
param_distributions_rf = {
'n_estimators': [50, 100, 200, 300, 500],
'max_depth': [None, 10, 20, 30, 40, 50],
'min_samples_split': [2, 5, 10, 15],
'min_samples_leaf': [1, 2, 4, 5],
'max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7],
'bootstrap': [True, False],
'class_weight': [None, 'balanced', 'balanced_subsample']
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# Use RandomizedSearchCV for efficiency
print("Training Random Forest with RandomizedSearchCV...")
random_search_rf = RandomizedSearchCV(
estimator=rf,
param_distributions=param_distributions_rf,
n_iter=60, # Sample 60 combinations
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
random_search_rf.fit(X_train_transformed, y_train)
# Best model
best_rf = random_search_rf.best_estimator_
best_params_rf = random_search_rf.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_rf.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_rf = best_rf.predict(X_train_transformed)
y_test_pred_rf = best_rf.predict(X_test_transformed)
y_train_proba_rf = best_rf.predict_proba(X_train_transformed)[:, 1]
y_test_proba_rf = best_rf.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_rf = accuracy_score(y_train, y_train_pred_rf)
test_acc_rf = accuracy_score(y_test, y_test_pred_rf)
# AUC
train_auc_rf = roc_auc_score(y_train, y_train_proba_rf)
test_auc_rf = roc_auc_score(y_test, y_test_proba_rf)
# Confusion matrices for Sensitivity/Specificity
cm_train_rf = confusion_matrix(y_train, y_train_pred_rf)
cm_test_rf = confusion_matrix(y_test, y_test_pred_rf)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_rf = cm_train_rf[0, 0] / (cm_train_rf[0, 0] + cm_train_rf[0, 1]) if (cm_train_rf[0, 0] + cm_train_rf[0, 1]) > 0 else 0
train_specificity_rf = cm_train_rf[1, 1] / (cm_train_rf[1, 0] + cm_train_rf[1, 1]) if (cm_train_rf[1, 0] + cm_train_rf[1, 1]) > 0 else 0
test_sensitivity_rf = cm_test_rf[0, 0] / (cm_test_rf[0, 0] + cm_test_rf[0, 1]) if (cm_test_rf[0, 0] + cm_test_rf[0, 1]) > 0 else 0
test_specificity_rf = cm_test_rf[1, 1] / (cm_test_rf[1, 0] + cm_test_rf[1, 1]) if (cm_test_rf[1, 0] + cm_test_rf[1, 1]) > 0 else 0
# Pack metrics
metrics_rf = {
'Model': 'Random Forest',
'Train Accuracy': train_acc_rf,
'Test Accuracy': test_acc_rf,
'Train AUC': train_auc_rf,
'Test AUC': test_auc_rf,
'Train Sensitivity': train_sensitivity_rf,
'Test Sensitivity': test_sensitivity_rf,
'Train Specificity': train_specificity_rf,
'Test Specificity': test_specificity_rf
}
# Display metrics
metrics_df_rf = pd.DataFrame([metrics_rf])
print("\nMetrics Summary:")
print(metrics_df_rf.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_rf, tpr_train_rf, _ = roc_curve(y_train, y_train_proba_rf)
fpr_test_rf, tpr_test_rf, _ = roc_curve(y_test, y_test_proba_rf)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_rf, tpr_train_rf, label=f'Train (AUC = {train_auc_rf:.3f})', linewidth=2)
plt.plot(fpr_test_rf, tpr_test_rf, label=f'Test (AUC = {test_auc_rf:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Random Forest', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_random_forest.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_random_forest.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_rf = 'models/random_forest_best.joblib'
dump(best_rf, model_path_rf)
print(f" Best model saved to: {model_path_rf}")
# Save best params
params_path_rf = 'artifacts/random_forest_best_params.json'
with open(params_path_rf, 'w') as f:
# Convert numpy types to native Python types for JSON serialization
serializable_params = {k: int(v) if isinstance(v, (np.integer)) else
float(v) if isinstance(v, (np.floating)) else v
for k, v in best_params_rf.items()}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_rf}")
# Save metrics
metrics_path_rf = 'artifacts/random_forest_metrics.json'
with open(metrics_path_rf, 'w') as f:
json.dump(metrics_rf, f, indent=2)
print(f" Metrics saved to: {metrics_path_rf}")
print("\n======= RANDOM FOREST COMPLETE =======\n")
# Return for potential downstream use
tuned_model_rf = best_rf
best_params_rf = best_params_rf
metrics_dict_rf = metrics_rf
======= 7.8 RANDOM FOREST =======
Training Random Forest with RandomizedSearchCV...
Fitting 5 folds for each of 60 candidates, totalling 300 fits
Best Hyperparameters:
n_estimators: 300
min_samples_split: 15
min_samples_leaf: 1
max_features: 0.5
max_depth: 50
class_weight: balanced
bootstrap: True
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Random Forest 0.981451 0.931678 0.999068 0.936897 0.995755 0.98559 0.930347 0.739069
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_random_forest.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/random_forest_best.joblib Best params saved to: artifacts/random_forest_best_params.json Metrics saved to: artifacts/random_forest_metrics.json ======= RANDOM FOREST COMPLETE =======
7.9 AdaBoost¶
Purpose & Approach:
- Implements a sequential boosting ensemble that combines weak learners (decision stumps) by iteratively reweighting misclassified samples to focus on hard-to-predict cases
- Serves as a bias reduction benchmark to compare sequential boosting against parallel ensemble methods (Bagging, Random Forest) and gradient-based boosting
- Uses adaptive boosting (SAMME algorithm) to build an additive model where each weak learner corrects errors from previous iterations
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored ensemble size (
n_estimators), learning rate for weight updates, base estimator depth (1-4 levels), and class weighting strategies - Selected configuration balances ensemble size with learning rate to prevent overfitting while maintaining strong sequential error correction
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a sequential boosting baseline to quantify whether iterative error-focused reweighting improves over parallel ensembles and gradient-based boosting methods
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis to assess whether adaptive reweighting causes training set memorization
# ========== 7.9 ADABOOST ==========
print("======= 7.9 ADABOOST =======\n")
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define the base AdaBoost model
adaboost = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1, random_state=RANDOM_STATE),
algorithm='SAMME', # Faster + usually higher AUC
random_state=RANDOM_STATE
)
# Optimized hyperparameter search space
param_dist_adaboost = {
'n_estimators': randint(50, 300), # reduced range
'learning_rate': uniform(0.01, 1.0), # continuous sample
'estimator__max_depth': randint(1, 4) # stumps or shallow trees
}
# Faster cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
print("Training AdaBoost with RandomizedSearchCV...")
random_search_adaboost = RandomizedSearchCV(
estimator=adaboost,
param_distributions=param_dist_adaboost,
n_iter=25, # 25 random samples instead of 240 grid combos
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
random_search_adaboost.fit(X_train_transformed, y_train)
# Best model
best_adaboost = random_search_adaboost.best_estimator_
best_params_adaboost = random_search_adaboost.best_params_
print("\n Best Hyperparameters:")
for param, value in best_params_adaboost.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_adaboost = best_adaboost.predict(X_train_transformed)
y_test_pred_adaboost = best_adaboost.predict(X_test_transformed)
y_train_proba_adaboost = best_adaboost.predict_proba(X_train_transformed)[:, 1]
y_test_proba_adaboost = best_adaboost.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_adaboost = accuracy_score(y_train, y_train_pred_adaboost)
test_acc_adaboost = accuracy_score(y_test, y_test_pred_adaboost)
# AUC
train_auc_adaboost = roc_auc_score(y_train, y_train_proba_adaboost)
test_auc_adaboost = roc_auc_score(y_test, y_test_proba_adaboost)
# Confusion matrices
cm_train_adaboost = confusion_matrix(y_train, y_train_pred_adaboost)
cm_test_adaboost = confusion_matrix(y_test, y_test_pred_adaboost)
# Sensitivity & Specificity
train_sensitivity_adaboost = cm_train_adaboost[0, 0] / (cm_train_adaboost[0, 0] + cm_train_adaboost[0, 1]) if (cm_train_adaboost[0, 0] + cm_train_adaboost[0, 1]) > 0 else 0
train_specificity_adaboost = cm_train_adaboost[1, 1] / (cm_train_adaboost[1, 0] + cm_train_adaboost[1, 1]) if (cm_train_adaboost[1, 0] + cm_train_adaboost[1, 1]) > 0 else 0
test_sensitivity_adaboost = cm_test_adaboost[0, 0] / (cm_test_adaboost[0, 0] + cm_test_adaboost[0, 1]) if (cm_test_adaboost[0, 0] + cm_test_adaboost[0, 1]) > 0 else 0
test_specificity_adaboost = cm_test_adaboost[1, 1] / (cm_test_adaboost[1, 0] + cm_test_adaboost[1, 1]) if (cm_test_adaboost[1, 0] + cm_test_adaboost[1, 1]) > 0 else 0
# Pack metrics
metrics_adaboost = {
'Model': 'AdaBoost',
'Train Accuracy': train_acc_adaboost,
'Test Accuracy': test_acc_adaboost,
'Train AUC': train_auc_adaboost,
'Test AUC': test_auc_adaboost,
'Train Sensitivity': train_sensitivity_adaboost,
'Test Sensitivity': test_sensitivity_adaboost,
'Train Specificity': train_specificity_adaboost,
'Test Specificity': test_specificity_adaboost
}
# Display metrics
metrics_df_adaboost = pd.DataFrame([metrics_adaboost])
print("\nMetrics Summary:")
print(metrics_df_adaboost.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_adaboost := roc_curve(y_train, y_train_proba_adaboost)[0],
tpr_train_adaboost := roc_curve(y_train, y_train_proba_adaboost)[1],
label=f'Train (AUC = {train_auc_adaboost:.3f})', linewidth=2)
plt.plot(fpr_test_adaboost := roc_curve(y_test, y_test_proba_adaboost)[0],
tpr_test_adaboost := roc_curve(y_test, y_test_proba_adaboost)[1],
label=f'Test (AUC = {test_auc_adaboost:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — AdaBoost', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_adaboost.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_adaboost.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_adaboost = 'models/adaboost_best.joblib'
dump(best_adaboost, model_path_adaboost)
print(f" Best model saved to: {model_path_adaboost}")
# Save best params
params_path_adaboost = 'artifacts/adaboost_best_params.json'
with open(params_path_adaboost, 'w') as f:
serializable_params = {
k: (int(v) if isinstance(v, np.integer) else float(v) if isinstance(v, np.floating) else v)
for k, v in best_params_adaboost.items()
}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_adaboost}")
# Save metrics
metrics_path_adaboost = 'artifacts/adaboost_metrics.json'
with open(metrics_path_adaboost, 'w') as f:
json.dump(metrics_adaboost, f, indent=2)
print(f" Metrics saved to: {metrics_path_adaboost}")
print("\n======= ADABOOST COMPLETE =======\n")
# Return for potential downstream use
tuned_model_adaboost = best_adaboost
best_params_adaboost = best_params_adaboost
metrics_dict_adaboost = metrics_adaboost
======= 7.9 ADABOOST ======= Training AdaBoost with RandomizedSearchCV... Fitting 3 folds for each of 25 candidates, totalling 75 fits
c:\Users\John\anaconda3\envs\mlproject\lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The parameter 'algorithm' is deprecated in 1.6 and has no effect. It will be removed in version 1.8. warnings.warn(
Best Hyperparameters: estimator__max_depth: 3 learning_rate: 0.6932635188254582 n_estimators: 221 ================================================== COMPUTING METRICS ================================================== Metrics Summary: Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity AdaBoost 0.912502 0.914713 0.928356 0.922013 0.974236 0.975128 0.691941 0.698872 ================================================== PLOTTING ROC CURVES ==================================================
ROC curve saved to: Output/roc_curve_adaboost.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/adaboost_best.joblib Best params saved to: artifacts/adaboost_best_params.json Metrics saved to: artifacts/adaboost_metrics.json ======= ADABOOST COMPLETE =======
7.10 Gradient Boosting¶
Purpose & Approach:
- Implements a gradient-based sequential boosting ensemble that builds an additive model by fitting shallow decision trees to the residual errors of previous iterations
- Serves as an advanced boosting benchmark to compare gradient-based optimization against adaptive reweighting (AdaBoost) and parallel ensemble methods (Random Forest, Bagging)
- Uses stagewise additive modeling where each tree minimizes a loss function's gradient, typically achieving superior performance on structured data
Hyperparameter Tuning:
- Tuned via 5-fold stratified RandomizedSearchCV optimizing ROC-AUC with 60 parameter combinations
- Explored ensemble size (
n_estimators), learning rate for gradient descent steps, tree depth (max_depth), subsampling rate for stochastic gradient boosting, split criteria (min_samples_split,min_samples_leaf), and feature randomization (max_features) - Selected configuration balances learning rate with ensemble size to prevent overfitting while maintaining strong sequential error correction through gradient optimization
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a gradient boosting benchmark to quantify whether gradient-based optimization outperforms adaptive reweighting (AdaBoost) and parallel aggregation methods (Bagging, Random Forest)
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis to assess whether sequential gradient fitting causes training set memorization
# ========== 7.10 GRADIENT BOOSTING ==========
print("======= 7.10 GRADIENT BOOSTING =======\n")
# Define the model
gb = GradientBoostingClassifier(random_state=RANDOM_STATE)
# Define hyperparameter grid
# Key parameters: n_estimators, learning_rate, max_depth, subsample
param_grid_gb = {
'n_estimators': [50, 100, 200, 300, 500],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 4, 5, 6, 7],
'subsample': [0.8, 0.9, 1.0],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None]
}
# Cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
# Use RandomizedSearchCV for efficiency (Gradient Boosting has many parameters)
print("Training Gradient Boosting with RandomizedSearchCV...")
random_search_gb = RandomizedSearchCV(
estimator=gb,
param_distributions=param_grid_gb,
n_iter=60, # Sample 60 combinations
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
random_search_gb.fit(X_train_transformed, y_train)
# Best model
best_gb = random_search_gb.best_estimator_
best_params_gb = random_search_gb.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_gb.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_gb = best_gb.predict(X_train_transformed)
y_test_pred_gb = best_gb.predict(X_test_transformed)
y_train_proba_gb = best_gb.predict_proba(X_train_transformed)[:, 1]
y_test_proba_gb = best_gb.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_gb = accuracy_score(y_train, y_train_pred_gb)
test_acc_gb = accuracy_score(y_test, y_test_pred_gb)
# AUC
train_auc_gb = roc_auc_score(y_train, y_train_proba_gb)
test_auc_gb = roc_auc_score(y_test, y_test_proba_gb)
# Confusion matrices for Sensitivity/Specificity
cm_train_gb = confusion_matrix(y_train, y_train_pred_gb)
cm_test_gb = confusion_matrix(y_test, y_test_pred_gb)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_gb = cm_train_gb[0, 0] / (cm_train_gb[0, 0] + cm_train_gb[0, 1]) if (cm_train_gb[0, 0] + cm_train_gb[0, 1]) > 0 else 0
train_specificity_gb = cm_train_gb[1, 1] / (cm_train_gb[1, 0] + cm_train_gb[1, 1]) if (cm_train_gb[1, 0] + cm_train_gb[1, 1]) > 0 else 0
test_sensitivity_gb = cm_test_gb[0, 0] / (cm_test_gb[0, 0] + cm_test_gb[0, 1]) if (cm_test_gb[0, 0] + cm_test_gb[0, 1]) > 0 else 0
test_specificity_gb = cm_test_gb[1, 1] / (cm_test_gb[1, 0] + cm_test_gb[1, 1]) if (cm_test_gb[1, 0] + cm_test_gb[1, 1]) > 0 else 0
# Pack metrics
metrics_gb = {
'Model': 'Gradient Boosting',
'Train Accuracy': train_acc_gb,
'Test Accuracy': test_acc_gb,
'Train AUC': train_auc_gb,
'Test AUC': test_auc_gb,
'Train Sensitivity': train_sensitivity_gb,
'Test Sensitivity': test_sensitivity_gb,
'Train Specificity': train_specificity_gb,
'Test Specificity': test_specificity_gb
}
# Display metrics
metrics_df_gb = pd.DataFrame([metrics_gb])
print("\nMetrics Summary:")
print(metrics_df_gb.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_gb, tpr_train_gb, _ = roc_curve(y_train, y_train_proba_gb)
fpr_test_gb, tpr_test_gb, _ = roc_curve(y_test, y_test_proba_gb)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_gb, tpr_train_gb, label=f'Train (AUC = {train_auc_gb:.3f})', linewidth=2)
plt.plot(fpr_test_gb, tpr_test_gb, label=f'Test (AUC = {test_auc_gb:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Gradient Boosting', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_gradient_boosting.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_gradient_boosting.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_gb = 'models/gradient_boosting_best.joblib'
dump(best_gb, model_path_gb)
print(f" Best model saved to: {model_path_gb}")
# Save best params
params_path_gb = 'artifacts/gradient_boosting_best_params.json'
with open(params_path_gb, 'w') as f:
# Convert numpy types to native Python types for JSON serialization
serializable_params = {k: int(v) if isinstance(v, (np.integer)) else
float(v) if isinstance(v, (np.floating)) else v
for k, v in best_params_gb.items()}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_gb}")
# Save metrics
metrics_path_gb = 'artifacts/gradient_boosting_metrics.json'
with open(metrics_path_gb, 'w') as f:
json.dump(metrics_gb, f, indent=2)
print(f" Metrics saved to: {metrics_path_gb}")
print("\n======= GRADIENT BOOSTING COMPLETE =======\n")
# Return for potential downstream use
tuned_model_gb = best_gb
best_params_gb = best_params_gb
metrics_dict_gb = metrics_gb
======= 7.10 GRADIENT BOOSTING =======
Training Gradient Boosting with RandomizedSearchCV...
Fitting 5 folds for each of 60 candidates, totalling 300 fits
Best Hyperparameters:
subsample: 0.9
n_estimators: 500
min_samples_split: 5
min_samples_leaf: 4
max_features: None
max_depth: 4
learning_rate: 0.1
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Gradient Boosting 0.957311 0.937847 0.987415 0.951394 0.998322 0.992302 0.810792 0.7433
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_gradient_boosting.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/gradient_boosting_best.joblib Best params saved to: artifacts/gradient_boosting_best_params.json Metrics saved to: artifacts/gradient_boosting_metrics.json ======= GRADIENT BOOSTING COMPLETE =======
7.11 Support Vector Machine (Linear)¶
Purpose & Approach:
- Implements a maximum-margin linear classifier using LinearSVC wrapped with calibrated probability estimates via
CalibratedClassifierCVfor credit risk prediction - Serves as a linear margin-based baseline to compare against kernel-based SVMs, probabilistic models, and ensemble methods by finding the optimal hyperplane that maximizes the margin between classes
Hyperparameter Tuning:
- Tuned via 5-fold stratified GridSearchCV optimizing ROC-AUC
- Explored regularization strength (
C) and class weighting strategies (None,balanced) to handle class imbalance - Probability calibration applied post-training using internal 5-fold cross-validation to enable probabilistic predictions
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a fast linear SVM benchmark to assess whether linear decision boundaries are sufficient or if nonlinear kernels (RBF SVM) provide meaningful performance gains
- Performance ranked in Section 10 leaderboard by test AUC with generalization gap analysis to evaluate margin-based classification effectiveness
# ========== 7.11 SUPPORT VECTOR MACHINE (LINEAR) ==========
print("======= 7.11 SUPPORT VECTOR MACHINE (LINEAR) =======\n")
# ----------------------------------------------------------
# FAST LINEAR SVM IMPLEMENTATION (LinearSVC + Calibration)
# ----------------------------------------------------------
# Base linear SVM (fast, but no probability)
base_linear_svm = LinearSVC(
C=1.0,
class_weight=None,
random_state=RANDOM_STATE,
max_iter=5000
)
# Wrap with probability calibration to match your expected outputs
svm_linear = CalibratedClassifierCV(
estimator=base_linear_svm,
cv=5 # internal CV for probability calibration
)
# NOTE: Hyperparameters apply to the underlying LinearSVC ("estimator__")
param_grid_svm_linear = {
'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100],
'estimator__class_weight': [None, 'balanced']
}
# Cross-validation
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
print("Training Linear SVM with GridSearchCV...")
grid_search_svm_linear = GridSearchCV(
estimator=svm_linear,
param_grid=param_grid_svm_linear,
cv=cv_strategy,
scoring='roc_auc',
n_jobs=-1,
verbose=1
)
grid_search_svm_linear.fit(X_train_transformed, y_train)
# Best model & params
best_svm_linear = grid_search_svm_linear.best_estimator_
best_params_svm_linear = grid_search_svm_linear.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_svm_linear.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_svm_linear = best_svm_linear.predict(X_train_transformed)
y_test_pred_svm_linear = best_svm_linear.predict(X_test_transformed)
# Probabilities (thanks to CalibratedClassifierCV)
y_train_proba_svm_linear = best_svm_linear.predict_proba(X_train_transformed)[:, 1]
y_test_proba_svm_linear = best_svm_linear.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_svm_linear = accuracy_score(y_train, y_train_pred_svm_linear)
test_acc_svm_linear = accuracy_score(y_test, y_test_pred_svm_linear)
# AUC
train_auc_svm_linear = roc_auc_score(y_train, y_train_proba_svm_linear)
test_auc_svm_linear = roc_auc_score(y_test, y_test_proba_svm_linear)
# Confusion matrices
cm_train_svm_linear = confusion_matrix(y_train, y_train_pred_svm_linear)
cm_test_svm_linear = confusion_matrix(y_test, y_test_pred_svm_linear)
# Sensitivity (Recall for class 0)
train_sensitivity_svm_linear = cm_train_svm_linear[0, 0] / (cm_train_svm_linear[0, 0] + cm_train_svm_linear[0, 1])
test_sensitivity_svm_linear = cm_test_svm_linear[0, 0] / (cm_test_svm_linear[0, 0] + cm_test_svm_linear[0, 1])
# Specificity (Recall for class 1)
train_specificity_svm_linear = cm_train_svm_linear[1, 1] / (cm_train_svm_linear[1, 0] + cm_train_svm_linear[1, 1])
test_specificity_svm_linear = cm_test_svm_linear[1, 1] / (cm_test_svm_linear[1, 0] + cm_test_svm_linear[1, 1])
# Metrics dictionary
metrics_svm_linear = {
'Model': 'SVM (Linear)',
'Train Accuracy': train_acc_svm_linear,
'Test Accuracy': test_acc_svm_linear,
'Train AUC': train_auc_svm_linear,
'Test AUC': test_auc_svm_linear,
'Train Sensitivity': train_sensitivity_svm_linear,
'Test Sensitivity': test_sensitivity_svm_linear,
'Train Specificity': train_specificity_svm_linear,
'Test Specificity': test_specificity_svm_linear
}
# Display metrics
metrics_df_svm_linear = pd.DataFrame([metrics_svm_linear])
print("\nMetrics Summary:")
print(metrics_df_svm_linear.to_string(index=False))
# ========== ROC CURVES ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
fpr_train_svm_linear, tpr_train_svm_linear, _ = roc_curve(y_train, y_train_proba_svm_linear)
fpr_test_svm_linear, tpr_test_svm_linear, _ = roc_curve(y_test, y_test_proba_svm_linear)
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_svm_linear, tpr_train_svm_linear, label=f'Train (AUC = {train_auc_svm_linear:.3f})', linewidth=2)
plt.plot(fpr_test_svm_linear, tpr_test_svm_linear, label=f'Test (AUC = {test_auc_svm_linear:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — SVM (Linear)', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_svm_linear.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_svm_linear.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
model_path_svm_linear = 'models/svm_linear_best.joblib'
dump(best_svm_linear, model_path_svm_linear)
print(f" Best model saved to: {model_path_svm_linear}")
params_path_svm_linear = 'artifacts/svm_linear_best_params.json'
with open(params_path_svm_linear, 'w') as f:
serializable_params = {
k: (int(v) if isinstance(v, np.integer) else
float(v) if isinstance(v, np.floating) else v)
for k, v in best_params_svm_linear.items()
}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_svm_linear}")
metrics_path_svm_linear = 'artifacts/svm_linear_metrics.json'
with open(metrics_path_svm_linear, 'w') as f:
json.dump(metrics_svm_linear, f, indent=2)
print(f" Metrics saved to: {metrics_path_svm_linear}")
print("\n======= SVM (LINEAR) COMPLETE =======\n")
# Return for downstream use (same as before)
tuned_model_svm_linear = best_svm_linear
best_params_svm_linear = best_params_svm_linear
metrics_dict_svm_linear = metrics_svm_linear
======= 7.11 SUPPORT VECTOR MACHINE (LINEAR) =======
Training Linear SVM with GridSearchCV...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters:
estimator__C: 100
estimator__class_weight: balanced
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
SVM (Linear) 0.872281 0.870759 0.892577 0.890186 0.949904 0.947493 0.594957 0.596615
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_svm_linear.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/svm_linear_best.joblib Best params saved to: artifacts/svm_linear_best_params.json Metrics saved to: artifacts/svm_linear_metrics.json ======= SVM (LINEAR) COMPLETE =======
7.12 Support Vector Machine (RBF)¶
Purpose & Approach:
- Implements a nonlinear kernel-based classifier using RBF (Radial Basis Function) SVM to capture complex decision boundaries that linear models cannot represent
- Serves as a high-flexibility margin-based benchmark to assess whether nonlinear transformations improve over linear SVM and probabilistic models for credit risk prediction
- Uses kernel trick to implicitly map features to high-dimensional space where nonlinear patterns become linearly separable
Hyperparameter Tuning:
- Tuned via 3-fold stratified RandomizedSearchCV on a 40% stratified subsample of training data to reduce computational cost while maintaining class balance
- Explored regularization strength (
C), kernel width (gamma), and class weighting strategies using 15 parameter combinations - Final model retrained on full training set with
probability=Trueusing best hyperparameters to enable calibrated probability estimates
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a nonlinear kernel benchmark to quantify performance gains over linear SVM and assess whether kernel-based classification justifies increased computational cost
- Performance ranked in Section 10 leaderboard by test AUC with generalization gap analysis to evaluate whether RBF kernel overfits compared to simpler models
# ========== 7.12 SUPPORT VECTOR MACHINE (RBF) ==========
print("======= 7.12 SUPPORT VECTOR MACHINE (RBF) =======\n")
# ============================================================
# OPTIMIZATION STRATEGY
# 1. Subsample the training set for tuning (40%).
# 2. Disable probability=True during tuning -> 3–5× faster.
# 3. Reduced, high-value search grid -> 4× faster.
# 4. Fewer CV folds (cv=3) -> 40% faster.
# 5. n_iter=15 instead of 30 -> 2× faster.
# ============================================================
# ----- 1. Subsample training data for faster hyperparameter tuning -----
X_sub, _, y_sub, _ = train_test_split(
X_train_transformed,
y_train,
train_size=0.40, # adjustable (0.3–0.5 recommended)
stratify=y_train,
random_state=RANDOM_STATE
)
print(f"Using subsample for tuning: {X_sub.shape[0]} rows")
# ----- 2. Define the model WITHOUT probability to avoid slow internal CV -----
svm_rbf_tune = SVC(kernel='rbf', probability=False, random_state=RANDOM_STATE)
# ----- 3. Optimized hyperparameter search grid -----
param_distributions_svm_rbf = {
'C': [0.1, 1, 10, 50, 100],
'gamma': ['scale', 0.001, 0.01, 0.1],
'class_weight': [None, 'balanced']
}
# ----- 4. Faster CV -----
cv_strategy_fast = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
# ----- 5. RandomizedSearchCV with fewer iterations -----
print("\nTraining FAST RBF SVM with RandomizedSearchCV...")
random_search_svm_rbf = RandomizedSearchCV(
estimator=svm_rbf_tune,
param_distributions=param_distributions_svm_rbf,
n_iter=15, # cut from 30 → 15 (much faster)
cv=cv_strategy_fast,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
random_search_svm_rbf.fit(X_sub, y_sub)
best_params_svm_rbf = random_search_svm_rbf.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_svm_rbf.items():
print(f" {param}: {value}")
# ============================================================
# FINAL TRAINING ON FULL DATA WITH probability=True
# ============================================================
print("\nFitting final RBF SVM model on FULL training data with probability=True...")
best_svm_rbf = SVC(
kernel='rbf',
C=best_params_svm_rbf['C'],
gamma=best_params_svm_rbf['gamma'],
class_weight=best_params_svm_rbf['class_weight'],
probability=True, # now safe, only once
random_state=RANDOM_STATE
)
best_svm_rbf.fit(X_train_transformed, y_train)
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_svm_rbf = best_svm_rbf.predict(X_train_transformed)
y_test_pred_svm_rbf = best_svm_rbf.predict(X_test_transformed)
# Probabilities
y_train_proba_svm_rbf = best_svm_rbf.predict_proba(X_train_transformed)[:, 1]
y_test_proba_svm_rbf = best_svm_rbf.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_svm_rbf = accuracy_score(y_train, y_train_pred_svm_rbf)
test_acc_svm_rbf = accuracy_score(y_test, y_test_pred_svm_rbf)
# AUC
train_auc_svm_rbf = roc_auc_score(y_train, y_train_proba_svm_rbf)
test_auc_svm_rbf = roc_auc_score(y_test, y_test_proba_svm_rbf)
# Confusion matrices for Sensitivity/Specificity
cm_train_svm_rbf = confusion_matrix(y_train, y_train_pred_svm_rbf)
cm_test_svm_rbf = confusion_matrix(y_test, y_test_pred_svm_rbf)
train_sensitivity_svm_rbf = cm_train_svm_rbf[0, 0] / (cm_train_svm_rbf[0, 0] + cm_train_svm_rbf[0, 1])
train_specificity_svm_rbf = cm_train_svm_rbf[1, 1] / (cm_train_svm_rbf[1, 0] + cm_train_svm_rbf[1, 1])
test_sensitivity_svm_rbf = cm_test_svm_rbf[0, 0] / (cm_test_svm_rbf[0, 0] + cm_test_svm_rbf[0, 1])
test_specificity_svm_rbf = cm_test_svm_rbf[1, 1] / (cm_test_svm_rbf[1, 0] + cm_test_svm_rbf[1, 1])
# Pack metrics
metrics_svm_rbf = {
'Model': 'SVM (RBF)',
'Train Accuracy': train_acc_svm_rbf,
'Test Accuracy': test_acc_svm_rbf,
'Train AUC': train_auc_svm_rbf,
'Test AUC': test_auc_svm_rbf,
'Train Sensitivity': train_sensitivity_svm_rbf,
'Test Sensitivity': test_sensitivity_svm_rbf,
'Train Specificity': train_specificity_svm_rbf,
'Test Specificity': test_specificity_svm_rbf
}
# Display metrics
metrics_df_svm_rbf = pd.DataFrame([metrics_svm_rbf])
print("\nMetrics Summary:")
print(metrics_df_svm_rbf.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
fpr_train_svm_rbf, tpr_train_svm_rbf, _ = roc_curve(y_train, y_train_proba_svm_rbf)
fpr_test_svm_rbf, tpr_test_svm_rbf, _ = roc_curve(y_test, y_test_proba_svm_rbf)
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_svm_rbf, tpr_train_svm_rbf, label=f'Train (AUC = {train_auc_svm_rbf:.3f})', linewidth=2)
plt.plot(fpr_test_svm_rbf, tpr_test_svm_rbf, label=f'Test (AUC = {test_auc_svm_rbf:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — SVM (RBF)', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_svm_rbf.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_svm_rbf.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_svm_rbf = 'models/svm_rbf_best.joblib'
dump(best_svm_rbf, model_path_svm_rbf)
print(f" Best model saved to: {model_path_svm_rbf}")
# Save best params
params_path_svm_rbf = 'artifacts/svm_rbf_best_params.json'
with open(params_path_svm_rbf, 'w') as f:
serializable_params = {
k: (int(v) if isinstance(v, np.integer)
else float(v) if isinstance(v, np.floating)
else v)
for k, v in best_params_svm_rbf.items()
}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_svm_rbf}")
# Save metrics
metrics_path_svm_rbf = 'artifacts/svm_rbf_metrics.json'
with open(metrics_path_svm_rbf, 'w') as f:
json.dump(metrics_svm_rbf, f, indent=2)
print(f" Metrics saved to: {metrics_path_svm_rbf}")
print("\n======= SVM (RBF) COMPLETE =======\n")
# Return for downstream use
tuned_model_svm_rbf = best_svm_rbf
best_params_svm_rbf = best_params_svm_rbf
metrics_dict_svm_rbf = metrics_svm_rbf
======= 7.12 SUPPORT VECTOR MACHINE (RBF) =======
Using subsample for tuning: 10372 rows
Training FAST RBF SVM with RandomizedSearchCV...
Fitting 3 folds for each of 15 candidates, totalling 45 fits
Best Hyperparameters:
gamma: scale
class_weight: balanced
C: 1
Fitting final RBF SVM model on FULL training data with probability=True...
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
SVM (RBF) 0.887475 0.876311 0.932581 0.908564 0.914269 0.907027 0.791747 0.766573
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_svm_rbf.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/svm_rbf_best.joblib Best params saved to: artifacts/svm_rbf_best_params.json Metrics saved to: artifacts/svm_rbf_metrics.json ======= SVM (RBF) COMPLETE =======
7.13 Neural Network (MLP)¶
Purpose & Approach:
- Implements a feedforward neural network (Multi-Layer Perceptron) using backpropagation to learn complex nonlinear patterns in credit risk data
- Serves as a deep learning baseline to assess whether neural architectures can capture feature interactions and nonlinearities that classical models miss
- Uses early stopping with validation holdout to prevent overfitting during iterative gradient-based training
Hyperparameter Tuning:
- Tuned via 3-fold stratified RandomizedSearchCV optimizing ROC-AUC with 15 parameter combinations
- Explored network architecture (
hidden_layer_sizes), L2 regularization strength (alpha), and learning rate for Adam optimizer - Reduced search space for computational efficiency: ReLU activation only, Adam solver, limited architecture candidates
Evaluation Metrics:
- Reported accuracy, ROC-AUC, sensitivity (Class 0), and specificity (Class 1) on both train and test sets
- Generated ROC curve and confusion matrix for interpretability
Model Comparison:
- Serves as a neural network benchmark to quantify whether deep learning provides meaningful gains over tree ensembles, SVMs, and linear models for credit risk prediction
- Performance ranked in Section 10 leaderboard by test AUC with overfitting gap analysis to evaluate whether iterative training causes memorization
# ========== 7.13 NEURAL NETWORK (MLP) ==========
print("======= 7.13 NEURAL NETWORK (MLP) =======\n")
# Define the model
mlp = MLPClassifier(random_state=RANDOM_STATE, max_iter=500, early_stopping=True,
validation_fraction=0.1, n_iter_no_change=10)
# Define REDUCED hyperparameter distribution for RandomizedSearchCV
# Focused on most impactful parameters to reduce runtime
param_distributions_mlp = {
'hidden_layer_sizes': [(50,), (100,), (100, 50)], # Reduced from 5 to 3 options
'activation': ['relu'], # Keep only relu (usually best performer)
'alpha': [0.0001, 0.001, 0.01], # Reduced from 4 to 3 options
'learning_rate_init': [0.001, 0.01], # Reduced from 3 to 2 options
'solver': ['adam'] # Remove 'sgd' - adam is usually better and faster
}
# Cross-validation strategy - reduced folds for speed
cv_strategy = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
# Use RandomizedSearchCV for efficiency
print("Training Neural Network (MLP) with RandomizedSearchCV...")
random_search_mlp = RandomizedSearchCV(
estimator=mlp,
param_distributions=param_distributions_mlp,
n_iter=15, # Reduced from 40 to 15 iterations
cv=cv_strategy, # 3-fold instead of 5-fold
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=RANDOM_STATE
)
random_search_mlp.fit(X_train_transformed, y_train)
# Best model
best_mlp = random_search_mlp.best_estimator_
best_params_mlp = random_search_mlp.best_params_
print(f"\n Best Hyperparameters:")
for param, value in best_params_mlp.items():
print(f" {param}: {value}")
# ========== COMPUTE METRICS ==========
print("\n" + "="*50)
print("COMPUTING METRICS")
print("="*50)
# Predictions
y_train_pred_mlp = best_mlp.predict(X_train_transformed)
y_test_pred_mlp = best_mlp.predict(X_test_transformed)
y_train_proba_mlp = best_mlp.predict_proba(X_train_transformed)[:, 1]
y_test_proba_mlp = best_mlp.predict_proba(X_test_transformed)[:, 1]
# Accuracy
train_acc_mlp = accuracy_score(y_train, y_train_pred_mlp)
test_acc_mlp = accuracy_score(y_test, y_test_pred_mlp)
# AUC
train_auc_mlp = roc_auc_score(y_train, y_train_proba_mlp)
test_auc_mlp = roc_auc_score(y_test, y_test_proba_mlp)
# Confusion matrices for Sensitivity/Specificity
cm_train_mlp = confusion_matrix(y_train, y_train_pred_mlp)
cm_test_mlp = confusion_matrix(y_test, y_test_pred_mlp)
# Sensitivity (TPR for class 0): cm[0,0] / (cm[0,0] + cm[0,1])
# Specificity (TNR for class 0): cm[1,1] / (cm[1,0] + cm[1,1])
train_sensitivity_mlp = cm_train_mlp[0, 0] / (cm_train_mlp[0, 0] + cm_train_mlp[0, 1]) if (cm_train_mlp[0, 0] + cm_train_mlp[0, 1]) > 0 else 0
train_specificity_mlp = cm_train_mlp[1, 1] / (cm_train_mlp[1, 0] + cm_train_mlp[1, 1]) if (cm_train_mlp[1, 0] + cm_train_mlp[1, 1]) > 0 else 0
test_sensitivity_mlp = cm_test_mlp[0, 0] / (cm_test_mlp[0, 0] + cm_test_mlp[0, 1]) if (cm_test_mlp[0, 0] + cm_test_mlp[0, 1]) > 0 else 0
test_specificity_mlp = cm_test_mlp[1, 1] / (cm_test_mlp[1, 0] + cm_test_mlp[1, 1]) if (cm_test_mlp[1, 0] + cm_test_mlp[1, 1]) > 0 else 0
# Pack metrics
metrics_mlp = {
'Model': 'Neural Network (MLP)',
'Train Accuracy': train_acc_mlp,
'Test Accuracy': test_acc_mlp,
'Train AUC': train_auc_mlp,
'Test AUC': test_auc_mlp,
'Train Sensitivity': train_sensitivity_mlp,
'Test Sensitivity': test_sensitivity_mlp,
'Train Specificity': train_specificity_mlp,
'Test Specificity': test_specificity_mlp
}
# Display metrics
metrics_df_mlp = pd.DataFrame([metrics_mlp])
print("\nMetrics Summary:")
print(metrics_df_mlp.to_string(index=False))
# ========== ROC CURVE ==========
print("\n" + "="*50)
print("PLOTTING ROC CURVES")
print("="*50)
# Compute ROC curves
fpr_train_mlp, tpr_train_mlp, _ = roc_curve(y_train, y_train_proba_mlp)
fpr_test_mlp, tpr_test_mlp, _ = roc_curve(y_test, y_test_proba_mlp)
# Plot
plt.figure(figsize=(10, 7))
plt.plot(fpr_train_mlp, tpr_train_mlp, label=f'Train (AUC = {train_auc_mlp:.3f})', linewidth=2)
plt.plot(fpr_test_mlp, tpr_test_mlp, label=f'Test (AUC = {test_auc_mlp:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curve — Neural Network (MLP)', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('Output/roc_curve_mlp.png', dpi=300, bbox_inches='tight')
plt.show()
print(" ROC curve saved to: Output/roc_curve_mlp.png")
# ========== SAVE ARTIFACTS ==========
print("\n" + "="*50)
print("SAVING ARTIFACTS")
print("="*50)
# Save best model
model_path_mlp = 'models/mlp_best.joblib'
dump(best_mlp, model_path_mlp)
print(f" Best model saved to: {model_path_mlp}")
# Save best params
params_path_mlp = 'artifacts/mlp_best_params.json'
with open(params_path_mlp, 'w') as f:
# Convert numpy types to native Python types for JSON serialization
serializable_params = {
k: (int(v) if isinstance(v, (np.integer)) else
float(v) if isinstance(v, (np.floating)) else
str(v) if isinstance(v, tuple) else v)
for k, v in best_params_mlp.items()
}
json.dump(serializable_params, f, indent=2)
print(f" Best params saved to: {params_path_mlp}")
# Save metrics
metrics_path_mlp = 'artifacts/mlp_metrics.json'
with open(metrics_path_mlp, 'w') as f:
json.dump(metrics_mlp, f, indent=2)
print(f" Metrics saved to: {metrics_path_mlp}")
print("\n======= NEURAL NETWORK (MLP) COMPLETE =======\n")
# Return for potential downstream use
tuned_model_mlp = best_mlp
best_params_mlp = best_params_mlp
metrics_dict_mlp = metrics_mlp
======= 7.13 NEURAL NETWORK (MLP) =======
Training Neural Network (MLP) with RandomizedSearchCV...
Fitting 3 folds for each of 15 candidates, totalling 45 fits
Best Hyperparameters:
solver: adam
learning_rate_init: 0.01
hidden_layer_sizes: (100, 50)
alpha: 0.001
activation: relu
==================================================
COMPUTING METRICS
==================================================
Metrics Summary:
Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity
Neural Network (MLP) 0.933094 0.925046 0.940987 0.914789 0.994373 0.988551 0.71416 0.698166
==================================================
PLOTTING ROC CURVES
==================================================
ROC curve saved to: Output/roc_curve_mlp.png ================================================== SAVING ARTIFACTS ================================================== Best model saved to: models/mlp_best.joblib Best params saved to: artifacts/mlp_best_params.json Metrics saved to: artifacts/mlp_metrics.json ======= NEURAL NETWORK (MLP) COMPLETE =======
8. Cross-Validation Protocol¶
This project uses Stratified K-Fold Cross-Validation to ensure that each fold preserves the class imbalance in the loan_status target. Cross-validation was integrated directly into all model tuning using either GridSearchCV or RandomizedSearchCV. Most models—including Logistic Regression, LDA/QDA, Naive Bayes, KNN, Decision Tree, Bagging, Random Forest, AdaBoost, Gradient Boosting, Linear SVM, and the MLP neural network—were tuned with 5-fold stratified CV, using ROC-AUC as the primary scoring metric. This guarantees consistent evaluation and prevents target leakage by ensuring hyperparameters are chosen within the CV loop.
The RBF SVM required a more efficient strategy due to its computational cost. It was tuned using 3-fold stratified CV, a reduced but effective parameter search space, and a 40% stratified subsample of the training data. After tuning, the best RBF SVM model was retrained with probability=True on the full training set to generate final probability estimates.
Overall, this cross-validation framework provides stable, unbiased model comparisons and reliably guides hyperparameter selection across all classifiers used in the analysis.
9. Model Comparison & Selection¶
This section aggregates the performance metrics from all trained models, ranks them by test-set AUC, and visualizes their relative performance. A consolidated leaderboard is constructed from each model’s evaluation results, including accuracy, ROC-AUC, sensitivity, specificity, and overfitting gaps. Several comparison plots are generated to highlight differences across models: overall AUC rankings, train–test performance contrasts, sensitivity–specificity trade-offs, overfitting patterns, and accuracy comparisons.
The best-performing model is selected based on test AUC, and its key performance metrics are reported. Model artifacts—including the leaderboard, visual comparison dashboard, and selection metadata—are saved for reproducibility. This process provides a clear, data-driven method for comparing all candidate classifiers and identifying the strongest model for credit risk prediction.
# ========== 9. MODEL COMPARISON & SELECTION ==========
print("======= 9. MODEL COMPARISON & SELECTION =======\n")
# ========== 1. AGGREGATE ALL MODEL METRICS ==========
print("1. AGGREGATING MODEL METRICS")
print("="*50)
# Collect all metrics dictionaries
all_metrics = [
metrics_dict_dummy,
metrics_dict_lr,
metrics_dict_lda,
metrics_dict_qda,
metrics_dict_gnb,
metrics_dict_knn,
metrics_dict_dt,
metrics_dict_bagging,
metrics_dict_rf,
metrics_dict_adaboost,
metrics_dict_gb,
metrics_dict_svm_linear,
metrics_dict_svm_rbf,
metrics_dict_mlp
]
# Create comprehensive leaderboard
leaderboard = pd.DataFrame(all_metrics)
# Sort by Test AUC (primary metric) descending
leaderboard = leaderboard.sort_values('Test AUC', ascending=False).reset_index(drop=True)
# Add rank column
leaderboard.insert(0, 'Rank', range(1, len(leaderboard) + 1))
# Calculate overfitting metrics (Train - Test gap)
leaderboard['AUC_Gap'] = leaderboard['Train AUC'] - leaderboard['Test AUC']
leaderboard['Acc_Gap'] = leaderboard['Train Accuracy'] - leaderboard['Test Accuracy']
print("\nMODEL LEADERBOARD (Ranked by Test AUC)")
print("="*100)
print(leaderboard.to_string(index=False))
# ========== 2. SAVE LEADERBOARD ==========
print("\n2. SAVING LEADERBOARD")
print("="*50)
leaderboard_path = 'artifacts/model_leaderboard.csv'
leaderboard.to_csv(leaderboard_path, index=False)
print(f"Leaderboard saved to: {leaderboard_path}")
# ========== 3. VISUALIZE MODEL COMPARISON ==========
print("\n3. VISUALIZING MODEL COMPARISONS")
print("="*50)
# Create comparison plots
fig = plt.figure(figsize=(18, 14)) # slightly taller
gs = fig.add_gridspec(3, 2, hspace=0.6, wspace=0.5)
# Plot 1: Test AUC Comparison (Bar Chart)
ax1 = fig.add_subplot(gs[0, :])
colors = ['red' if model == 'DummyClassifier (Baseline)' else
'green' if i == 0 else 'steelblue'
for i, model in enumerate(leaderboard['Model'])]
bars = ax1.barh(leaderboard['Model'], leaderboard['Test AUC'], color=colors, alpha=0.7, edgecolor='black')
ax1.set_xlabel('Test AUC', fontsize=12, fontweight='bold')
ax1.set_title('Model Comparison: Test AUC (Primary Metric)', fontsize=14, fontweight='bold')
ax1.axvline(x=0.5, color='red', linestyle='--', linewidth=1, alpha=0.5)
ax1.grid(axis='x', alpha=0.3)
# Add value labels
for i, (bar, val) in enumerate(zip(bars, leaderboard['Test AUC'])):
ax1.text(val + 0.01, bar.get_y() + bar.get_height()/2, f'{val:.3f}',
va='center', fontsize=9, fontweight='bold')
# Plot 2: Train vs Test AUC
ax2 = fig.add_subplot(gs[1, 0])
x_pos = np.arange(len(leaderboard))
width = 0.35
ax2.bar(x_pos - width/2, leaderboard['Train AUC'], width, label='Train AUC', alpha=0.8)
ax2.bar(x_pos + width/2, leaderboard['Test AUC'], width, label='Test AUC', alpha=0.8)
ax2.set_ylabel('AUC', fontsize=12, fontweight='bold')
ax2.set_title('Train vs Test AUC', fontsize=12, fontweight='bold')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(leaderboard['Model'], rotation=45, ha='right', fontsize=9)
ax2.legend()
ax2.grid(axis='y', alpha=0.3)
ax2.axhline(y=0.5, color='red', linestyle='--', linewidth=1, alpha=0.5)
# Plot 3: Sensitivity vs Specificity
ax3 = fig.add_subplot(gs[1, 1])
scatter = ax3.scatter(
leaderboard['Test Sensitivity'],
leaderboard['Test Specificity'],
s=leaderboard['Test AUC']*300,
alpha=0.6,
c=range(len(leaderboard)),
cmap='viridis',
edgecolors='black',
linewidth=1
)
ax3.set_xlabel('Test Sensitivity (Class 0)', fontsize=11, fontweight='bold')
ax3.set_ylabel('Test Specificity (Class 1)', fontsize=11, fontweight='bold')
ax3.set_title('Sensitivity vs Specificity (size = AUC)', fontsize=12, fontweight='bold')
ax3.grid(alpha=0.3)
for i, model in enumerate(leaderboard['Model']):
ax3.annotate(model[:10],
(leaderboard['Test Sensitivity'].iloc[i], leaderboard['Test Specificity'].iloc[i]),
fontsize=7, alpha=0.7)
# Plot 4: Overfitting Gap (AUC Gap)
ax4 = fig.add_subplot(gs[2, 0])
ax4.barh(leaderboard['Model'], leaderboard['AUC_Gap'],
color=['red' if gap > 0.1 else 'orange' if gap > 0.05 else 'green'
for gap in leaderboard['AUC_Gap']],
alpha=0.7, edgecolor='black')
ax4.set_xlabel('AUC Gap (Train - Test)', fontsize=12, fontweight='bold')
ax4.set_title('Overfitting Analysis (AUC Gap)', fontsize=12, fontweight='bold')
ax4.axvline(x=0.05, color='orange', linestyle='--', linewidth=1, alpha=0.7)
ax4.axvline(x=0.1, color='red', linestyle='--', linewidth=1, alpha=0.7)
ax4.grid(axis='x', alpha=0.3)
# Plot 5: Accuracy Comparison
ax5 = fig.add_subplot(gs[2, 1])
ax5.bar(x_pos - width/2, leaderboard['Train Accuracy'], width, label='Train Acc', alpha=0.8)
ax5.bar(x_pos + width/2, leaderboard['Test Accuracy'], width, label='Test Acc', alpha=0.8)
ax5.set_ylabel('Accuracy', fontsize=12, fontweight='bold')
ax5.set_title('Train vs Test Accuracy', fontsize=12, fontweight='bold')
ax5.set_xticks(x_pos)
ax5.set_xticklabels(leaderboard['Model'], rotation=45, ha='right', fontsize=9)
ax5.legend()
ax5.grid(axis='y', alpha=0.3)
# Final spacing adjustments (prevents overlapping)
fig.subplots_adjust(
top=0.95,
bottom=0.05,
left=0.07,
right=0.98,
hspace=0.6,
wspace=0.5
)
plt.savefig('artifacts/model_comparison_dashboard.png', dpi=300, bbox_inches='tight')
plt.show()
print("Saved: artifacts/model_comparison_dashboard.png")
# ========== 4. SELECT BEST MODEL ==========
print("\n4. MODEL SELECTION")
print("="*50)
best_model_idx = 0 # Already sorted by Test AUC
best_model_name = leaderboard.loc[best_model_idx, 'Model']
best_test_auc = leaderboard.loc[best_model_idx, 'Test AUC']
print(f"SELECTED MODEL: {best_model_name}")
print(f"Test AUC: {best_test_auc:.4f}")
print(f"Test Accuracy: {leaderboard.loc[best_model_idx, 'Test Accuracy']:.4f}")
print(f"Test Sensitivity: {leaderboard.loc[best_model_idx, 'Test Sensitivity']:.4f}")
print(f"Test Specificity: {leaderboard.loc[best_model_idx, 'Test Specificity']:.4f}")
print(f"Overfitting Gap (AUC): {leaderboard.loc[best_model_idx, 'AUC_Gap']:.4f}")
model_mapping = {
'DummyClassifier (Baseline)': tuned_model_dummy,
'Logistic Regression': tuned_model_lr,
'Linear Discriminant Analysis': tuned_model_lda,
'Quadratic Discriminant Analysis': tuned_model_qda,
'Gaussian Naive Bayes': tuned_model_gnb,
'K-Nearest Neighbors': tuned_model_knn,
'Decision Tree': tuned_model_dt,
'Bagging': tuned_model_bagging,
'Random Forest': tuned_model_rf,
'AdaBoost': tuned_model_adaboost,
'Gradient Boosting': tuned_model_gb,
'SVM (Linear)': tuned_model_svm_linear,
'SVM (RBF)': tuned_model_svm_rbf,
'Neural Network (MLP)': tuned_model_mlp
}
best_model = model_mapping[best_model_name]
# ========== 5. SAVE BEST MODEL ==========
print("\n5. SAVING BEST MODEL")
print("="*50)
best_model_path = 'models/best_model.joblib'
dump(best_model, best_model_path)
print(f"Best model saved to: {best_model_path}")
# Save selection metadata
selection_metadata = {
'selected_model': best_model_name,
'selection_criteria': 'Test AUC',
'test_auc': float(best_test_auc),
'test_accuracy': float(leaderboard.loc[best_model_idx, 'Test Accuracy']),
'test_sensitivity': float(leaderboard.loc[best_model_idx, 'Test Sensitivity']),
'test_specificity': float(leaderboard.loc[best_model_idx, 'Test Specificity']),
'overfitting_gap_auc': float(leaderboard.loc[best_model_idx, 'AUC_Gap']),
'rank': 1,
'selection_timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
selection_metadata_path = 'artifacts/model_selection_metadata.json'
with open(selection_metadata_path, 'w') as f:
json.dump(selection_metadata, f, indent=2)
print(f"Selection metadata saved to: {selection_metadata_path}")
# ========== 6. TOP MODELS SUMMARY ==========
print("\n6. TOP 5 MODELS SUMMARY")
print("="*50)
top_5 = leaderboard.head(5)[['Rank', 'Model', 'Test AUC', 'Test Accuracy', 'AUC_Gap']]
print(top_5.to_string(index=False))
# ========== 7. PERFORMANCE INSIGHTS ==========
print("\n7. PERFORMANCE INSIGHTS")
print("="*50)
baseline_auc = leaderboard[leaderboard['Model'] == 'DummyClassifier (Baseline)']['Test AUC'].values[0]
improvement = best_test_auc - baseline_auc
print(f"Best model outperformed the baseline by {improvement:.4f} AUC points ({(improvement/baseline_auc)*100:.1f}% improvement).")
high_overfitting = leaderboard[leaderboard['AUC_Gap'] > 0.1]['Model'].tolist()
if high_overfitting:
print(f"Models with high overfitting (gap > 0.1): {', '.join(high_overfitting)}")
else:
print("No models show high overfitting (all gaps ≤ 0.1).")
best_generalization_idx = leaderboard['AUC_Gap'].idxmin()
best_gen_model = leaderboard.loc[best_generalization_idx, 'Model']
best_gen_gap = leaderboard.loc[best_generalization_idx, 'AUC_Gap']
print(f"Best generalization: {best_gen_model} (AUC gap = {best_gen_gap:.4f})")
print("\n======= MODEL COMPARISON & SELECTION COMPLETE =======")
print(f"Best Model: {best_model_name}")
======= 9. MODEL COMPARISON & SELECTION =======
1. AGGREGATING MODEL METRICS
==================================================
MODEL LEADERBOARD (Ranked by Test AUC)
====================================================================================================
Rank Model Train Accuracy Test Accuracy Train AUC Test AUC Train Sensitivity Test Sensitivity Train Specificity Test Specificity AUC_Gap Acc_Gap
1 Gradient Boosting 0.957311 0.937847 0.987415 0.951394 0.998322 0.992302 0.810792 0.743300 0.036021 0.019464
2 Bagging 0.999807 0.934762 1.000000 0.937623 1.000000 0.992104 0.999118 0.729901 0.062377 0.065045
3 Random Forest 0.981451 0.931678 0.999068 0.936897 0.995755 0.985590 0.930347 0.739069 0.062171 0.049774
4 AdaBoost 0.912502 0.914713 0.928356 0.922013 0.974236 0.975128 0.691941 0.698872 0.006343 -0.002211
5 Neural Network (MLP) 0.933094 0.925046 0.940987 0.914789 0.994373 0.988551 0.714160 0.698166 0.026197 0.008048
6 Decision Tree 0.937876 0.927360 0.930192 0.911335 0.991906 0.984998 0.744842 0.721439 0.018857 0.010516
7 SVM (RBF) 0.887475 0.876311 0.932581 0.908564 0.914269 0.907027 0.791747 0.766573 0.024017 0.011164
8 K-Nearest Neighbors 1.000000 0.885873 1.000000 0.894739 1.000000 0.985393 1.000000 0.530324 0.105261 0.114127
9 Logistic Regression 0.834529 0.830352 0.893121 0.890529 0.843048 0.842479 0.804091 0.787024 0.002592 0.004177
10 SVM (Linear) 0.872281 0.870759 0.892577 0.890186 0.949904 0.947493 0.594957 0.596615 0.002391 0.001523
11 Linear Discriminant Analysis 0.873670 0.872764 0.884986 0.883134 0.944771 0.940979 0.619644 0.629055 0.001852 0.000906
12 Quadratic Discriminant Analysis 0.862795 0.866595 0.882744 0.879489 0.902769 0.907027 0.719979 0.722144 0.003255 -0.003800
13 Gaussian Naive Bayes 0.845712 0.845774 0.852637 0.849821 0.930211 0.930715 0.543819 0.542313 0.002815 -0.000062
14 DummyClassifier (Baseline) 0.655676 0.654226 0.496105 0.492562 0.779725 0.779905 0.212485 0.205219 0.003543 0.001451
2. SAVING LEADERBOARD
==================================================
Leaderboard saved to: artifacts/model_leaderboard.csv
3. VISUALIZING MODEL COMPARISONS
==================================================
Saved: artifacts/model_comparison_dashboard.png
4. MODEL SELECTION
==================================================
SELECTED MODEL: Gradient Boosting
Test AUC: 0.9514
Test Accuracy: 0.9378
Test Sensitivity: 0.9923
Test Specificity: 0.7433
Overfitting Gap (AUC): 0.0360
5. SAVING BEST MODEL
==================================================
Best model saved to: models/best_model.joblib
Selection metadata saved to: artifacts/model_selection_metadata.json
6. TOP 5 MODELS SUMMARY
==================================================
Rank Model Test AUC Test Accuracy AUC_Gap
1 Gradient Boosting 0.951394 0.937847 0.036021
2 Bagging 0.937623 0.934762 0.062377
3 Random Forest 0.936897 0.931678 0.062171
4 AdaBoost 0.922013 0.914713 0.006343
5 Neural Network (MLP) 0.914789 0.925046 0.026197
7. PERFORMANCE INSIGHTS
==================================================
Best model outperformed the baseline by 0.4588 AUC points (93.2% improvement).
Models with high overfitting (gap > 0.1): K-Nearest Neighbors
Best generalization: Linear Discriminant Analysis (AUC gap = 0.0019)
======= MODEL COMPARISON & SELECTION COMPLETE =======
Best Model: Gradient Boosting
10. Evaluation on Hold-Out Test Set¶
The final chosen model was evaluated on the untouched test set to assess its true generalization performance. Predictions and calibrated probabilities were generated and used to compute a full suite of classification metrics, including accuracy, balanced accuracy, F1 score, ROC AUC, PR AUC, Brier score, sensitivity, and specificity. A confusion matrix was created to visualize correct and incorrect classifications across both loan-status classes.
Multiple diagnostic plots were produced to evaluate different aspects of model behavior: ROC and Precision–Recall curves for ranking performance, a calibration curve assessing probability reliability, a threshold analysis to study metric trade-offs at different decision cutoffs, and a distribution plot of predicted probabilities across true classes. Finally, all test-set metrics and analysis outputs were saved for reproducibility and for downstream reporting.
This evaluation provides a comprehensive view of the model’s performance on unseen data and confirms whether the selected model generalizes effectively for credit-risk prediction.
# ========== 10. EVALUATION ON HOLD-OUT TEST SET ==========
print("======= 10. EVALUATION ON HOLD-OUT TEST SET =======\n")
# Best model is already trained on full training data from Section 10
# best_model = model_mapping[best_model_name]
print(f"Selected Model: {best_model_name}")
print(f"Model already trained on full training set ({X_train_transformed.shape[0]} samples)")
print(f"Evaluating on hold-out test set ({X_test_transformed.shape[0]} samples)\n")
# ========== 1. PREDICTIONS ON TEST SET ==========
print("1. GENERATING PREDICTIONS")
print("="*50)
y_test_pred_final = best_model.predict(X_test_transformed)
y_test_proba_final = best_model.predict_proba(X_test_transformed)[:, 1]
print(f"Predictions generated for {len(y_test_pred_final)} test samples")
# ========== 2. CONFUSION MATRIX ==========
print("\n2. CONFUSION MATRIX")
print("="*50)
cm_test_final = confusion_matrix(y_test, y_test_pred_final)
# Plot confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_test_final, annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default (0)', 'Default (1)'],
yticklabels=['No Default (0)', 'Default (1)'],
ax=ax, cbar_kws={'label': 'Count'})
ax.set_xlabel('Predicted Label', fontsize=12, fontweight='bold')
ax.set_ylabel('True Label', fontsize=12, fontweight='bold')
ax.set_title(f'Confusion Matrix — {best_model_name} (Test Set)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('artifacts/confusion_matrix_test.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/confusion_matrix_test.png")
# Print confusion matrix with labels
print("\nConfusion Matrix:")
print(f" Predicted: No Default Predicted: Default")
print(f"Actual: No Default {cm_test_final[0,0]:>6} {cm_test_final[0,1]:>6}")
print(f"Actual: Default {cm_test_final[1,0]:>6} {cm_test_final[1,1]:>6}")
# ========== 3. CLASSIFICATION METRICS ==========
print("\n3. CLASSIFICATION METRICS")
print("="*50)
# Calculate comprehensive metrics
test_accuracy = accuracy_score(y_test, y_test_pred_final)
test_balanced_acc = balanced_accuracy_score(y_test, y_test_pred_final)
test_f1 = f1_score(y_test, y_test_pred_final)
test_auc = roc_auc_score(y_test, y_test_proba_final)
test_avg_precision = average_precision_score(y_test, y_test_proba_final)
test_brier = brier_score_loss(y_test, y_test_proba_final)
# Sensitivity (TPR for class 0) and Specificity (TNR for class 0)
test_sensitivity = cm_test_final[0, 0] / (cm_test_final[0, 0] + cm_test_final[0, 1]) if (cm_test_final[0, 0] + cm_test_final[0, 1]) > 0 else 0
test_specificity = cm_test_final[1, 1] / (cm_test_final[1, 0] + cm_test_final[1, 1]) if (cm_test_final[1, 0] + cm_test_final[1, 1]) > 0 else 0
# Create metrics summary
test_metrics = {
'Model': best_model_name,
'Test Set Size': len(y_test),
'Accuracy': test_accuracy,
'Balanced Accuracy': test_balanced_acc,
'F1 Score': test_f1,
'ROC AUC': test_auc,
'Average Precision (PR AUC)': test_avg_precision,
'Brier Score': test_brier,
'Sensitivity (Class 0 Recall)': test_sensitivity,
'Specificity (Class 1 Recall)': test_specificity,
'True Negatives': int(cm_test_final[0, 0]),
'False Positives': int(cm_test_final[0, 1]),
'False Negatives': int(cm_test_final[1, 0]),
'True Positives': int(cm_test_final[1, 1])
}
print("\nTest Set Performance Metrics:")
for metric, value in test_metrics.items():
if isinstance(value, float):
print(f" {metric}: {value:.4f}")
else:
print(f" {metric}: {value}")
# ========== 4. ROC CURVE ==========
print("\n4. ROC CURVE")
print("="*50)
fpr_test_final, tpr_test_final, thresholds_roc = roc_curve(y_test, y_test_proba_final)
plt.figure(figsize=(10, 7))
plt.plot(fpr_test_final, tpr_test_final, linewidth=2, label=f'Test (AUC = {test_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title(f'ROC Curve — {best_model_name} (Test Set)', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('artifacts/roc_curve_test.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/roc_curve_test.png")
# ========== 5. PRECISION-RECALL CURVE ==========
print("\n5. PRECISION-RECALL CURVE")
print("="*50)
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_test_proba_final)
plt.figure(figsize=(10, 7))
plt.plot(recall, precision, linewidth=2, label=f'Test (AP = {test_avg_precision:.4f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--', linewidth=1,
label=f'Baseline (No Skill) = {y_test.mean():.4f}')
plt.xlabel('Recall', fontsize=12, fontweight='bold')
plt.ylabel('Precision', fontsize=12, fontweight='bold')
plt.title(f'Precision-Recall Curve — {best_model_name} (Test Set)', fontsize=14, fontweight='bold')
plt.legend(loc='best', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('artifacts/precision_recall_curve_test.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/precision_recall_curve_test.png")
# ========== 6. THRESHOLD ANALYSIS ==========
print("\n6. THRESHOLD ANALYSIS")
print("="*50)
# Sample thresholds for analysis
threshold_candidates = [0.3, 0.4, 0.5, 0.6, 0.7]
threshold_results = []
for thresh in threshold_candidates:
y_pred_thresh = (y_test_proba_final >= thresh).astype(int)
cm_thresh = confusion_matrix(y_test, y_pred_thresh)
acc_thresh = accuracy_score(y_test, y_pred_thresh)
f1_thresh = f1_score(y_test, y_pred_thresh)
sens_thresh = cm_thresh[0, 0] / (cm_thresh[0, 0] + cm_thresh[0, 1]) if (cm_thresh[0, 0] + cm_thresh[0, 1]) > 0 else 0
spec_thresh = cm_thresh[1, 1] / (cm_thresh[1, 0] + cm_thresh[1, 1]) if (cm_thresh[1, 0] + cm_thresh[1, 1]) > 0 else 0
threshold_results.append({
'Threshold': thresh,
'Accuracy': acc_thresh,
'F1 Score': f1_thresh,
'Sensitivity': sens_thresh,
'Specificity': spec_thresh
})
threshold_df = pd.DataFrame(threshold_results)
print("\nThreshold Analysis:")
print(threshold_df.to_string(index=False))
# ========== 7. CALIBRATION CURVE ==========
print("\n7. CALIBRATION ANALYSIS")
print("="*50)
# Compute calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
y_test, y_test_proba_final, n_bins=10, strategy='uniform'
)
plt.figure(figsize=(10, 7))
plt.plot(mean_predicted_value, fraction_of_positives, 's-', linewidth=2,
label=f'{best_model_name} (Brier = {test_brier:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Perfect Calibration')
plt.xlabel('Mean Predicted Probability', fontsize=12, fontweight='bold')
plt.ylabel('Fraction of Positives', fontsize=12, fontweight='bold')
plt.title(f'Calibration Curve — {best_model_name} (Test Set)', fontsize=14, fontweight='bold')
plt.legend(loc='best', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('artifacts/calibration_curve_test.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/calibration_curve_test.png")
print(f"\nBrier Score: {test_brier:.4f}")
print(" (Lower is better; 0 = perfect calibration)")
# ========== 8. CLASSIFICATION REPORT ==========
print("\n8. DETAILED CLASSIFICATION REPORT")
print("="*50)
class_report = classification_report(y_test, y_test_pred_final,
target_names=['No Default (0)', 'Default (1)'],
digits=4)
print("\n" + class_report)
# ========== 9. SAVE TEST METRICS ==========
print("\n9. SAVING TEST METRICS")
print("="*50)
# Convert to JSON-serializable format
test_metrics_json = {k: (float(v) if isinstance(v, (np.floating, np.integer)) else v)
for k, v in test_metrics.items()}
metrics_test_path = 'artifacts/metrics_test.json'
with open(metrics_test_path, 'w') as f:
json.dump(test_metrics_json, f, indent=2)
print(f" Test metrics saved to: {metrics_test_path}")
# Save threshold analysis
threshold_df.to_csv('artifacts/threshold_analysis.csv', index=False)
print(" Threshold analysis saved to: artifacts/threshold_analysis.csv")
# ========== 10. COMPREHENSIVE SUMMARY DASHBOARD ==========
print("\n10. CREATING COMPREHENSIVE SUMMARY DASHBOARD")
print("="*50)
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.4, wspace=0.4)
# Plot 1: Confusion Matrix
ax1 = fig.add_subplot(gs[0, 0])
sns.heatmap(cm_test_final, annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default', 'Default'],
yticklabels=['No Default', 'Default'],
ax=ax1, cbar=False)
ax1.set_title('Confusion Matrix', fontweight='bold', fontsize=11)
ax1.set_xlabel('Predicted', fontweight='bold')
ax1.set_ylabel('Actual', fontweight='bold')
# Plot 2: ROC Curve
ax2 = fig.add_subplot(gs[0, 1])
ax2.plot(fpr_test_final, tpr_test_final, linewidth=2)
ax2.plot([0, 1], [0, 1], 'k--', linewidth=1)
ax2.set_xlabel('False Positive Rate', fontweight='bold')
ax2.set_ylabel('True Positive Rate', fontweight='bold')
ax2.set_title(f'ROC (AUC = {test_auc:.4f})', fontweight='bold', fontsize=11)
ax2.grid(alpha=0.3)
# Plot 3: Precision-Recall Curve
ax3 = fig.add_subplot(gs[0, 2])
ax3.plot(recall, precision, linewidth=2)
ax3.axhline(y=y_test.mean(), color='r', linestyle='--', linewidth=1)
ax3.set_xlabel('Recall', fontweight='bold')
ax3.set_ylabel('Precision', fontweight='bold')
ax3.set_title(f'PR Curve (AP = {test_avg_precision:.4f})', fontweight='bold', fontsize=11)
ax3.grid(alpha=0.3)
# Plot 4: Calibration Curve
ax4 = fig.add_subplot(gs[1, 0])
ax4.plot(mean_predicted_value, fraction_of_positives, 's-', linewidth=2)
ax4.plot([0, 1], [0, 1], 'k--', linewidth=1)
ax4.set_xlabel('Mean Predicted Prob', fontweight='bold')
ax4.set_ylabel('Fraction of Positives', fontweight='bold')
ax4.set_title(f'Calibration (Brier = {test_brier:.4f})', fontweight='bold', fontsize=11)
ax4.grid(alpha=0.3)
# Plot 5: Threshold Analysis
ax5 = fig.add_subplot(gs[1, 1])
ax5.plot(threshold_df['Threshold'], threshold_df['Accuracy'], 'o-', label='Accuracy')
ax5.plot(threshold_df['Threshold'], threshold_df['F1 Score'], 's-', label='F1 Score')
ax5.plot(threshold_df['Threshold'], threshold_df['Sensitivity'], '^-', label='Sensitivity')
ax5.plot(threshold_df['Threshold'], threshold_df['Specificity'], 'v-', label='Specificity')
ax5.set_xlabel('Decision Threshold', fontweight='bold')
ax5.set_ylabel('Metric Value', fontweight='bold')
ax5.set_title('Threshold Analysis', fontweight='bold', fontsize=11)
ax5.legend(fontsize=8)
ax5.grid(alpha=0.3)
# Plot 6: Metrics Bar Chart
ax6 = fig.add_subplot(gs[1, 2])
metrics_to_plot = ['Accuracy', 'Balanced Accuracy', 'F1 Score', 'ROC AUC', 'Average Precision (PR AUC)']
values_to_plot = [test_metrics[m] for m in metrics_to_plot]
bars = ax6.barh(metrics_to_plot, values_to_plot, color='steelblue', alpha=0.7)
ax6.set_xlabel('Score', fontweight='bold')
ax6.set_title('Performance Metrics', fontweight='bold', fontsize=11)
ax6.set_xlim(0, 1)
for bar, val in zip(bars, values_to_plot):
ax6.text(val + 0.01, bar.get_y() + bar.get_height()/2, f'{val:.3f}',
va='center', fontsize=9)
ax6.set_yticklabels(ax6.get_yticklabels(), rotation=45, ha='right')
ax6.grid(axis='x', alpha=0.3)
# Plot 7: Predicted Probability Distribution
ax7 = fig.add_subplot(gs[2, :])
ax7.hist(y_test_proba_final[y_test == 0], bins=50, alpha=0.6, label='No Default (Actual)', color='green')
ax7.hist(y_test_proba_final[y_test == 1], bins=50, alpha=0.6, label='Default (Actual)', color='red')
ax7.axvline(x=0.5, color='black', linestyle='--', linewidth=1, label='Default Threshold (0.5)')
ax7.set_xlabel('Predicted Probability of Default', fontweight='bold')
ax7.set_ylabel('Frequency', fontweight='bold')
ax7.set_title('Predicted Probability Distribution by True Class', fontweight='bold', fontsize=11)
ax7.legend()
ax7.grid(alpha=0.3)
fig.suptitle(f'Test Set Evaluation Dashboard — {best_model_name}',
fontsize=16, fontweight='bold', y=0.995)
plt.savefig('artifacts/test_evaluation_dashboard.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/test_evaluation_dashboard.png")
print("\n======= TEST SET EVALUATION COMPLETE =======")
print(f"\nFINAL TEST PERFORMANCE ({best_model_name}):")
print(f" • ROC AUC: {test_auc:.4f}")
print(f" • PR AUC (Average Precision): {test_avg_precision:.4f}")
print(f" • Accuracy: {test_accuracy:.4f}")
print(f" • F1 Score: {test_f1:.4f}")
print(f" • Brier Score: {test_brier:.4f}")
print(f" • Sensitivity (TPR for Class 0): {test_sensitivity:.4f}")
print(f" • Specificity (TPR for Class 1): {test_specificity:.4f}")
print("\nAll test evaluation artifacts saved to 'artifacts/' directory.")
======= 10. EVALUATION ON HOLD-OUT TEST SET ======= Selected Model: Gradient Boosting Model already trained on full training set (25932 samples) Evaluating on hold-out test set (6484 samples) 1. GENERATING PREDICTIONS ================================================== Predictions generated for 6484 test samples 2. CONFUSION MATRIX ==================================================
Saved: artifacts/confusion_matrix_test.png
Confusion Matrix:
Predicted: No Default Predicted: Default
Actual: No Default 5027 39
Actual: Default 364 1054
3. CLASSIFICATION METRICS
==================================================
Test Set Performance Metrics:
Model: Gradient Boosting
Test Set Size: 6484
Accuracy: 0.9378
Balanced Accuracy: 0.8678
F1 Score: 0.8395
ROC AUC: 0.9514
Average Precision (PR AUC): 0.9118
Brier Score: 0.0499
Sensitivity (Class 0 Recall): 0.9923
Specificity (Class 1 Recall): 0.7433
True Negatives: 5027
False Positives: 39
False Negatives: 364
True Positives: 1054
4. ROC CURVE
==================================================
Saved: artifacts/roc_curve_test.png 5. PRECISION-RECALL CURVE ==================================================
Saved: artifacts/precision_recall_curve_test.png
6. THRESHOLD ANALYSIS
==================================================
Threshold Analysis:
Threshold Accuracy F1 Score Sensitivity Specificity
0.3 0.932603 0.837123 0.971970 0.791961
0.4 0.936922 0.841165 0.985393 0.763752
0.5 0.937847 0.839506 0.992302 0.743300
0.6 0.938310 0.837925 0.996842 0.729196
0.7 0.937847 0.834904 0.999210 0.718618
7. CALIBRATION ANALYSIS
==================================================
Saved: artifacts/calibration_curve_test.png
Brier Score: 0.0499
(Lower is better; 0 = perfect calibration)
8. DETAILED CLASSIFICATION REPORT
==================================================
precision recall f1-score support
No Default (0) 0.9325 0.9923 0.9615 5066
Default (1) 0.9643 0.7433 0.8395 1418
accuracy 0.9378 6484
macro avg 0.9484 0.8678 0.9005 6484
weighted avg 0.9394 0.9378 0.9348 6484
9. SAVING TEST METRICS
==================================================
Test metrics saved to: artifacts/metrics_test.json
Threshold analysis saved to: artifacts/threshold_analysis.csv
10. CREATING COMPREHENSIVE SUMMARY DASHBOARD
==================================================
C:\Users\John\AppData\Local\Temp\ipykernel_4644\1099511930.py:274: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax6.set_yticklabels(ax6.get_yticklabels(), rotation=45, ha='right')
Saved: artifacts/test_evaluation_dashboard.png ======= TEST SET EVALUATION COMPLETE ======= FINAL TEST PERFORMANCE (Gradient Boosting): • ROC AUC: 0.9514 • PR AUC (Average Precision): 0.9118 • Accuracy: 0.9378 • F1 Score: 0.8395 • Brier Score: 0.0499 • Sensitivity (TPR for Class 0): 0.9923 • Specificity (TPR for Class 1): 0.7433 All test evaluation artifacts saved to 'artifacts/' directory.
11. Model Interpretation¶
This section analyzes how the selected model makes predictions by examining feature importance and model behavior. If the final model is tree-based, native impurity-based feature importances are extracted to identify the most influential predictors and to assess how much variance can be explained by the top-ranked features. Cumulative importance plots help determine how many features are needed to capture most of the model’s predictive power.
To provide a model-agnostic perspective, permutation importance is computed on the test set, offering a more reliable measure of each feature’s true impact on predictive performance. The top contributing features are visualized along with their variability across repeated shuffles.
For deeper insight into how individual features influence predictions, partial dependence plots are generated for the strongest numeric predictors, and—when applicable—two-way interaction plots illustrate how pairs of features jointly affect the model output.
All interpretation artifacts, including feature importance rankings, permutation importance results, partial dependence plots, and a summary of key insights, are saved for later reference. This analysis clarifies which features drive the model’s decisions and highlights opportunities for feature reduction and further model refinement.
# ========== 11. MODEL INTERPRETATION ==========
print("======= 11. MODEL INTERPRETATION =======\n")
print(f"Interpreting: {best_model_name}")
print(f"Model type: {type(best_model).__name__}")
# ========== 1. FEATURE IMPORTANCE (TREE-BASED MODELS) ==========
print("\n1. FEATURE IMPORTANCE ANALYSIS")
print("="*50)
# Check if model has feature_importances_ attribute (tree-based)
has_native_importance = hasattr(best_model, 'feature_importances_')
if has_native_importance:
print("\nUsing native feature importances (Gini/impurity-based)...")
# Get feature importances
importances = best_model.feature_importances_
# Get feature names from preprocessor
try:
feature_names = preprocessor.get_feature_names_out()
except:
feature_names = [f"feature_{i}" for i in range(len(importances))]
# Create importance dataframe
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importances
}).sort_values('importance', ascending=False)
# Display top 20 features
print("\nTop 20 Most Important Features:")
print(importance_df.head(20).to_string(index=False))
# Plot feature importances
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
# Top 20 features
top_20 = importance_df.head(20)
axes[0].barh(range(len(top_20)), top_20['importance'].values, color='steelblue', alpha=0.7)
axes[0].set_yticks(range(len(top_20)))
axes[0].set_yticklabels([name.split('__')[1] if '__' in name else name
for name in top_20['feature']], fontsize=9)
axes[0].set_xlabel('Importance (Gini/Impurity)', fontsize=12, fontweight='bold')
axes[0].set_title('Top 20 Feature Importances', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)
# Cumulative importance
importance_df_sorted = importance_df.sort_values('importance', ascending=False).reset_index(drop=True)
cumulative_importance = np.cumsum(importance_df_sorted['importance'].values)
axes[1].plot(range(len(cumulative_importance)), cumulative_importance, linewidth=2, color='darkgreen')
axes[1].axhline(y=0.8, color='red', linestyle='--', linewidth=1, label='80% threshold')
axes[1].axhline(y=0.9, color='orange', linestyle='--', linewidth=1, label='90% threshold')
axes[1].set_xlabel('Number of Features', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Cumulative Importance', fontsize=12, fontweight='bold')
axes[1].set_title('Cumulative Feature Importance', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.savefig('artifacts/feature_importance_native.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/feature_importance_native.png")
# Save importance data
importance_df.to_csv('artifacts/feature_importance.csv', index=False)
print(" Saved: artifacts/feature_importance.csv")
# Calculate how many features needed for 80% and 90% importance
n_80 = np.argmax(cumulative_importance >= 0.8) + 1
n_90 = np.argmax(cumulative_importance >= 0.9) + 1
print(f"\nFeatures needed for 80% importance: {n_80}/{len(importances)}")
print(f"Features needed for 90% importance: {n_90}/{len(importances)}")
else:
print("Model does not have native feature importances (not tree-based)")
importance_df = None
# ========== 2. PERMUTATION IMPORTANCE (MODEL-AGNOSTIC) ==========
print("\n2. PERMUTATION IMPORTANCE ANALYSIS")
print("="*50)
print("\nComputing permutation importance on test set (this may take a moment)...")
perm_importance = permutation_importance(
best_model,
X_test_transformed,
y_test,
n_repeats=10,
random_state=RANDOM_STATE,
scoring='roc_auc',
n_jobs=-1
)
# Get feature names
try:
feature_names_perm = preprocessor.get_feature_names_out()
except:
feature_names_perm = [f"feature_{i}" for i in range(X_test_transformed.shape[1])]
# Create permutation importance dataframe
perm_importance_df = pd.DataFrame({
'feature': feature_names_perm,
'importance_mean': perm_importance.importances_mean,
'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)
print("\nTop 20 Features (Permutation Importance):")
print(perm_importance_df.head(20).to_string(index=False))
# Plot permutation importance
fig, ax = plt.subplots(figsize=(12, 10))
top_20_perm = perm_importance_df.head(20)
y_pos = range(len(top_20_perm))
ax.barh(y_pos, top_20_perm['importance_mean'].values,
xerr=top_20_perm['importance_std'].values,
color='coral', alpha=0.7, edgecolor='black', linewidth=1)
ax.set_yticks(y_pos)
ax.set_yticklabels([name.split('__')[1] if '__' in name else name
for name in top_20_perm['feature']], fontsize=9)
ax.set_xlabel('Decrease in ROC AUC', fontsize=12, fontweight='bold')
ax.set_title('Top 20 Features - Permutation Importance (Test Set)', fontsize=14, fontweight='bold')
ax.invert_yaxis()
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('artifacts/permutation_importance.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/permutation_importance.png")
# Save permutation importance
perm_importance_df.to_csv('artifacts/permutation_importance.csv', index=False)
print(" Saved: artifacts/permutation_importance.csv")
# ========== 3. PARTIAL DEPENDENCE PLOTS ==========
print("\n3. PARTIAL DEPENDENCE PLOTS")
print("="*50)
# Identify top numeric features for PDP
# Get numeric feature indices
numeric_feature_indices = []
for idx, name in enumerate(feature_names_perm):
if 'num__' in name:
numeric_feature_indices.append(idx)
# Select top 6 numeric features by permutation importance
top_numeric_features = []
for idx, row in perm_importance_df.iterrows():
feat_idx = list(feature_names_perm).index(row['feature'])
if feat_idx in numeric_feature_indices:
top_numeric_features.append(feat_idx)
if len(top_numeric_features) >= 6:
break
if len(top_numeric_features) > 0:
print(f"\nGenerating Partial Dependence Plots for top {len(top_numeric_features)} numeric features...")
# Create PDP display
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle(f'Partial Dependence Plots — {best_model_name}',
fontsize=16, fontweight='bold', y=0.995)
for idx, feat_idx in enumerate(top_numeric_features):
ax = axes[idx // 3, idx % 3]
# Create PDP for single feature
display = PartialDependenceDisplay.from_estimator(
best_model,
X_test_transformed,
features=[feat_idx],
ax=ax,
kind='average',
grid_resolution=50
)
# Update title to show feature name
feat_name = feature_names_perm[feat_idx]
clean_name = feat_name.split('__')[1] if '__' in feat_name else feat_name
ax.set_title(f'PDP: {clean_name}', fontsize=11, fontweight='bold')
ax.set_xlabel('Feature Value', fontsize=10)
ax.set_ylabel('Partial Dependence', fontsize=10)
# Remove empty subplots if fewer than 6 features
for idx in range(len(top_numeric_features), 6):
fig.delaxes(axes[idx // 3, idx % 3])
plt.tight_layout()
plt.savefig('artifacts/partial_dependence_plots.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/partial_dependence_plots.png")
else:
print("No numeric features found for Partial Dependence Plots")
# ========== 4. TWO-WAY INTERACTIONS (IF TREE-BASED) ==========
if has_native_importance and len(top_numeric_features) >= 2:
print("\n4. TWO-WAY INTERACTION PLOTS")
print("="*50)
print("\nGenerating 2D Partial Dependence for top 2 feature pairs...")
# Select top 2 pairs
top_2_features = top_numeric_features[:2]
fig, ax = plt.subplots(figsize=(10, 8))
# Create 2D PDP
display = PartialDependenceDisplay.from_estimator(
best_model,
X_test_transformed,
features=[(top_2_features[0], top_2_features[1])],
ax=ax,
kind='average',
grid_resolution=30
)
# Update title
feat1_name = feature_names_perm[top_2_features[0]].split('__')[1] if '__' in feature_names_perm[top_2_features[0]] else feature_names_perm[top_2_features[0]]
feat2_name = feature_names_perm[top_2_features[1]].split('__')[1] if '__' in feature_names_perm[top_2_features[1]] else feature_names_perm[top_2_features[1]]
ax.set_title(f'2D Partial Dependence: {feat1_name} vs {feat2_name}',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('artifacts/partial_dependence_2d.png', dpi=300, bbox_inches='tight')
plt.show()
print(" Saved: artifacts/partial_dependence_2d.png")
# ========== 5. INTERPRETATION SUMMARY ==========
print("\n5. INTERPRETATION SUMMARY")
print("="*50)
summary = {
'model': best_model_name,
'has_native_importance': has_native_importance,
'n_features': X_train_transformed.shape[1],
'interpretation_methods': ['permutation_importance']
}
if has_native_importance and importance_df is not None:
summary['interpretation_methods'].append('native_feature_importance')
summary['top_5_features_native'] = importance_df.head(5)['feature'].tolist()
summary['n_features_for_80pct_importance'] = int(n_80)
summary['n_features_for_90pct_importance'] = int(n_90)
summary['top_5_features_permutation'] = perm_importance_df.head(5)['feature'].tolist()
if len(top_numeric_features) > 0:
summary['interpretation_methods'].append('partial_dependence')
summary['n_pdp_plots'] = len(top_numeric_features)
# Save summary
summary_path = 'artifacts/interpretation_summary.json'
with open(summary_path, 'w') as f:
# Convert non-serializable types
summary_clean = {k: (v if not isinstance(v, (np.integer, np.floating)) else float(v))
for k, v in summary.items()}
json.dump(summary_clean, f, indent=2)
print(f"\n Interpretation summary saved to: {summary_path}")
# ========== 6. KEY INSIGHTS ==========
print("\n6. KEY INSIGHTS")
print("="*50)
print("\n INTERPRETATION INSIGHTS:")
if has_native_importance and importance_df is not None:
print(f"\n • Top feature (native): {importance_df.iloc[0]['feature'].split('__')[1] if '__' in importance_df.iloc[0]['feature'] else importance_df.iloc[0]['feature']}")
print(f" - Importance: {importance_df.iloc[0]['importance']:.4f}")
print(f"\n • Top feature (permutation): {perm_importance_df.iloc[0]['feature'].split('__')[1] if '__' in perm_importance_df.iloc[0]['feature'] else perm_importance_df.iloc[0]['feature']}")
print(f" - Impact on AUC: {perm_importance_df.iloc[0]['importance_mean']:.4f} ± {perm_importance_df.iloc[0]['importance_std']:.4f}")
if has_native_importance:
print(f"\n • Feature efficiency:")
print(f" - {n_80} features explain 80% of importance")
print(f" - {n_90} features explain 90% of importance")
print(f" - Potential for feature reduction from {len(importances)} to ~{n_90} features")
# Identify engineered vs original features in top 10
if importance_df is not None:
top_10_features = importance_df.head(10)['feature'].tolist()
engineered_in_top_10 = []
original_in_top_10 = []
for feat in top_10_features:
clean_name = feat.split('__')[1] if '__' in feat else feat
# Check if engineered (rough heuristic based on naming)
if any(eng_feat in clean_name for eng_feat in ['dti', 'total_loan', 'income_to_loan',
'age_to_cred', 'stability', 'log_',
'bucket', 'quartile', 'risk_profile']):
engineered_in_top_10.append(clean_name)
else:
original_in_top_10.append(clean_name)
if len(engineered_in_top_10) > 0:
print(f"\n • Feature engineering impact:")
print(f" - {len(engineered_in_top_10)}/10 top features are engineered")
print(f" - Engineered features: {', '.join(engineered_in_top_10[:5])}")
print("\n======= MODEL INTERPRETATION COMPLETE =======")
print("\nAll interpretation artifacts saved to 'artifacts/' directory:")
print(" • Feature importance CSV and plots")
print(" • Permutation importance CSV and plots")
print(" • Partial dependence plots")
print(" • Interpretation summary JSON")
======= 11. MODEL INTERPRETATION =======
Interpreting: Gradient Boosting
Model type: GradientBoostingClassifier
1. FEATURE IMPORTANCE ANALYSIS
==================================================
Using native feature importances (Gini/impurity-based)...
Top 20 Most Important Features:
feature importance
num__income_to_loan 0.164002
num__loan_int_rate 0.153556
cat__person_home_ownership_RENT 0.136506
num__dti_ratio 0.121010
num__person_income 0.070013
num__log_income 0.056462
cat__loan_grade_D 0.048836
num__employment_stability 0.039599
cat__loan_grade_C 0.025396
cat__loan_intent_DEBTCONSOLIDATION 0.025253
cat__loan_intent_MEDICAL 0.024075
cat__person_home_ownership_OWN 0.018060
num__total_loan_cost 0.014182
cat__loan_intent_HOMEIMPROVEMENT 0.013872
num__person_age 0.013505
num__loan_percent_income 0.012613
cat__loan_grade_E 0.008161
num__risk_profile 0.007349
num__age_to_cred_hist 0.006516
cat__loan_intent_VENTURE 0.004183
Saved: artifacts/feature_importance_native.png
Saved: artifacts/feature_importance.csv
Features needed for 80% importance: 9/46
Features needed for 90% importance: 14/46
2. PERMUTATION IMPORTANCE ANALYSIS
==================================================
Computing permutation importance on test set (this may take a moment)...
Top 20 Features (Permutation Importance):
feature importance_mean importance_std
num__dti_ratio 0.143085 0.003020
num__loan_percent_income 0.096880 0.002647
num__income_to_loan 0.081914 0.001749
num__log_income 0.035465 0.002359
num__person_income 0.033506 0.002041
cat__person_home_ownership_OWN 0.020129 0.003174
cat__loan_grade_D 0.017233 0.001568
cat__person_home_ownership_RENT 0.014485 0.000895
cat__loan_intent_HOMEIMPROVEMENT 0.012324 0.001036
num__loan_int_rate 0.012298 0.001361
cat__loan_intent_VENTURE 0.008380 0.001615
num__person_age 0.007180 0.000609
num__total_loan_cost 0.006167 0.001298
cat__loan_intent_DEBTCONSOLIDATION 0.005603 0.000506
cat__loan_grade_E 0.005598 0.000654
num__employment_stability 0.003941 0.000604
cat__loan_intent_MEDICAL 0.003928 0.000587
num__log_loan_amnt 0.003633 0.000688
num__loan_amnt 0.003169 0.000785
cat__loan_grade_C 0.001684 0.000506
Saved: artifacts/permutation_importance.png Saved: artifacts/permutation_importance.csv 3. PARTIAL DEPENDENCE PLOTS ================================================== Generating Partial Dependence Plots for top 6 numeric features...
Saved: artifacts/partial_dependence_plots.png 4. TWO-WAY INTERACTION PLOTS ================================================== Generating 2D Partial Dependence for top 2 feature pairs...
Saved: artifacts/partial_dependence_2d.png
5. INTERPRETATION SUMMARY
==================================================
Interpretation summary saved to: artifacts/interpretation_summary.json
6. KEY INSIGHTS
==================================================
INTERPRETATION INSIGHTS:
• Top feature (native): income_to_loan
- Importance: 0.1640
• Top feature (permutation): dti_ratio
- Impact on AUC: 0.1431 ± 0.0030
• Feature efficiency:
- 9 features explain 80% of importance
- 14 features explain 90% of importance
- Potential for feature reduction from 46 to ~14 features
• Feature engineering impact:
- 4/10 top features are engineered
- Engineered features: income_to_loan, dti_ratio, log_income, employment_stability
======= MODEL INTERPRETATION COMPLETE =======
All interpretation artifacts saved to 'artifacts/' directory:
• Feature importance CSV and plots
• Permutation importance CSV and plots
• Partial dependence plots
• Interpretation summary JSON
12. Reproducibility & Artifacts¶
This section consolidates and verifies all project outputs to ensure full reproducibility. All expected artifacts—including trained models, preprocessing objects, evaluation metrics, data dictionaries, feature-importance files, and visualization assets—are checked for completeness. A comprehensive project summary is generated, capturing dataset details, feature counts, training and test sizes, selected model performance, and the number of artifacts produced across each category.
Environment information (Python version and library versions) is saved to support exact reproducibility of results. The notebook is exported to HTML for report generation, and a final deliverables checklist confirms that all required components of Milestone 2 have been completed. A final summary report compiles key results, performance metrics, and model descriptions, and all artifacts are organized into structured folders for submission. This ensures that the entire analysis is transparent, traceable, and fully reproducible.
# ========== 12. REPRODUCIBILITY & ARTIFACTS ==========
print("======= 12. REPRODUCIBILITY & ARTIFACTS =======\n")
# ========== 1. VERIFY ALL ARTIFACTS EXIST ==========
print("1. VERIFYING ARTIFACTS")
print("="*50)
# Define expected artifacts
expected_artifacts = {
'models': [
'preprocessor.joblib',
'preprocessor_engineered.joblib',
'best_model.joblib',
'dummy_baseline.joblib',
'logistic_regression_best.joblib',
'lda_best.joblib',
'qda_best.joblib',
'naive_bayes_best.joblib',
'knn_best.joblib',
'decision_tree_best.joblib',
'bagging_best.joblib',
'random_forest_best.joblib',
'adaboost_best.joblib',
'gradient_boosting_best.joblib',
'svm_linear_best.joblib',
'svm_rbf_best.joblib',
'mlp_best.joblib'
],
'artifacts': [
'data_dictionary_pre_preprocessing.csv',
'data_dictionary_post_preprocessing.csv',
'feature_provenance.json',
'model_leaderboard.csv',
'model_selection_metadata.json',
'metrics_test.json',
'threshold_analysis.csv',
'feature_importance.csv',
'permutation_importance.csv',
'interpretation_summary.json'
],
'Output': [
'numeric_distributions.png',
'categorical_distributions.png',
'numeric_vs_target_boxplots.png',
'categorical_vs_target_stacked.png',
'pearson_correlation_matrix.png',
'spearman_correlation_matrix.png',
'roc_curve_dummy_baseline.png',
'roc_curve_logistic_regression.png',
'roc_curve_lda.png',
'roc_curve_qda.png',
'roc_curve_naive_bayes.png',
'roc_curve_knn.png',
'roc_curve_decision_tree.png',
'roc_curve_bagging.png',
'roc_curve_random_forest.png',
'roc_curve_adaboost.png',
'roc_curve_gradient_boosting.png',
'roc_curve_svm_linear.png',
'roc_curve_svm_rbf.png',
'roc_curve_mlp.png'
]
}
# Check artifacts
missing_artifacts = []
for folder, files in expected_artifacts.items():
for file in files:
filepath = f"{folder}/{file}"
if not os.path.exists(filepath):
missing_artifacts.append(filepath)
if missing_artifacts:
print("\n Missing artifacts:")
for artifact in missing_artifacts:
print(f" - {artifact}")
else:
print("\n All expected artifacts present")
# ========== 2. CREATE COMPREHENSIVE ARTIFACT SUMMARY ==========
print("\n2. CREATING ARTIFACT SUMMARY")
print("="*50)
artifact_summary = {
'project': 'Team 6 - Milestone 2',
'dataset': 'credit_risk_dataset.csv',
'target': 'loan_status',
'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'random_seed': RANDOM_STATE,
'test_size': TEST_SIZE,
'n_samples': {
'total': len(df_clean),
'train': X_train_transformed.shape[0],
'test': X_test_transformed.shape[0]
},
'n_features': {
'original': len(feature_provenance['original']),
'engineered': len(feature_provenance['engineered']),
'post_preprocessing': X_train_transformed.shape[1]
},
'best_model': {
'name': best_model_name,
'test_auc': float(test_auc),
'test_accuracy': float(test_accuracy),
'test_f1': float(test_f1)
},
'artifacts': {
'models': len(expected_artifacts['models']),
'data_files': len(expected_artifacts['artifacts']),
'visualizations': len(expected_artifacts['Output'])
}
}
summary_path = 'artifacts/project_summary.json'
with open(summary_path, 'w') as f:
json.dump(artifact_summary, f, indent=2)
print(f" Project summary saved to: {summary_path}")
# ========== 3. SAVE ENVIRONMENT INFORMATION ==========
print("\n3. SAVING ENVIRONMENT INFORMATION")
print("="*50)
environment_info = {
'python_version': sys.version,
'numpy_version': np.__version__,
'pandas_version': pd.__version__,
'sklearn_version': __import__('sklearn').__version__,
'matplotlib_version': __import__('matplotlib').__version__,
'seaborn_version': sns.__version__,
'random_seed': RANDOM_STATE,
'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
env_path = 'artifacts/environment_info.json'
with open(env_path, 'w') as f:
json.dump(environment_info, f, indent=2)
print(f" Environment info saved to: {env_path}")
# ========== 4. EXPORT NOTEBOOK TO HTML (for PDF conversion) ==========
print("\n4. EXPORTING NOTEBOOK TO HTML")
print("="*50)
try:
# Get the notebook name from the current working directory
notebook_name = 'Team6_Milestone2.ipynb' # Adjust if needed
# Use nbconvert to export to HTML
os.system(f'jupyter nbconvert --to html --execute "{notebook_name}" --output "artifacts/Team6_Milestone2_Report.html"')
print(f" Notebook exported to HTML: artifacts/Team6_Milestone2_Report.html")
print("\n To convert to PDF:")
print(" Option 1: Use your browser to open the HTML and 'Print to PDF'")
print(" Option 2: Install wkhtmltopdf and run:")
print(" wkhtmltopdf artifacts/Team6_Milestone2_Report.html artifacts/Team6_Milestone2_Report.pdf")
print(" Option 3: Use nbconvert with LaTeX (requires LaTeX installation):")
print(f" jupyter nbconvert --to pdf \"{notebook_name}\"")
except Exception as e:
print(f" Could not auto-export notebook: {e}")
print("\n Manual export instructions:")
print(" 1. File → Download as → HTML (.html)")
print(" 2. Open HTML in browser and Print to PDF")
print(" OR")
print(" 3. File → Download as → PDF via LaTeX (.pdf)")
# ========== 5. CREATE FINAL DELIVERABLES CHECKLIST ==========
print("\n5. FINAL DELIVERABLES CHECKLIST")
print("="*50)
checklist = {
'EDA with plots': True,
'Full preprocessing pipeline': True,
'Feature engineering': True,
'Final dataset characteristics': True,
'Multiple candidate algorithms (14 models)': True,
'Cross-validation protocol': True,
'Hyperparameter tuning': True,
'Model comparison artifact': True,
'Selected model identified': True,
'Performance metrics (CM, ROC/AUC, PR, accuracy/F1)': True,
'Model interpretation (feature importances, PDP)': True,
'Reproducibility artifacts': True,
'Code quality & documentation': True
}
print("\nDELIVERABLES STATUS:")
for item, status in checklist.items():
print(f" {item}")
# ========== 6. GENERATE FINAL SUMMARY REPORT ==========
print("\n6. GENERATING FINAL SUMMARY REPORT")
print("="*50)
summary_report = f"""
{'='*80}
TEAM 6 - MILESTONE 2 - FINAL SUMMARY REPORT
{'='*80}
PROJECT INFORMATION
-------------------
Dataset: credit_risk_dataset.csv
Target: loan_status (binary classification: 0=No Default, 1=Default)
Team Members: John Holik, Claiton Pinto, Marina Bunyatova
Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
DATA SUMMARY
------------
Total Samples: {len(df_clean):,}
- Training Set: {X_train_transformed.shape[0]:,} ({(1-TEST_SIZE)*100:.0f}%)
- Test Set: {X_test_transformed.shape[0]:,} ({TEST_SIZE*100:.0f}%)
Class Distribution (Test Set):
- Class 0 (No Default): {test_class_counts[0]:,} ({test_class_pcts[0]:.2%})
- Class 1 (Default): {test_class_counts[1]:,} ({test_class_pcts[1]:.2%})
FEATURE ENGINEERING
-------------------
Original Features: {len(feature_provenance['original'])}
Engineered Features: {len(feature_provenance['engineered'])}
Total Features (after preprocessing): {X_train_transformed.shape[1]}
MODELS EVALUATED
----------------
{len(all_metrics)} models trained and evaluated:
{chr(10).join([f" {i+1}. {m['Model']}" for i, m in enumerate(all_metrics)])}
BEST MODEL SELECTION
--------------------
Selected Model: {best_model_name}
Selection Criterion: Test AUC (ROC-AUC Score)
FINAL TEST PERFORMANCE
----------------------
ROC AUC: {test_auc:.4f}
PR AUC (Average Precision): {test_avg_precision:.4f}
Accuracy: {test_accuracy:.4f}
Balanced Accuracy: {test_balanced_acc:.4f}
F1 Score: {test_f1:.4f}
Brier Score: {test_brier:.4f}
Sensitivity (Class 0): {test_sensitivity:.4f}
Specificity (Class 1): {test_specificity:.4f}
Confusion Matrix:
Predicted: No Default Predicted: Default
Actual: No Default {cm_test_final[0,0]:>6} {cm_test_final[0,1]:>6}
Actual: Default {cm_test_final[1,0]:>6} {cm_test_final[1,1]:>6}
ARTIFACTS GENERATED
-------------------
Models: {len(expected_artifacts['models'])} saved models
Data Files: {len(expected_artifacts['artifacts'])} CSV/JSON files
Visualizations: {len(expected_artifacts['Output'])} plots
All artifacts saved to:
- models/ (trained models and preprocessors)
- artifacts/ (metrics, summaries, analysis outputs)
- Output/ (EDA and model evaluation plots)
REPRODUCIBILITY
---------------
Random Seed: {RANDOM_STATE}
Python Version: {sys.version.split()[0]}
Key Libraries:
- NumPy: {np.__version__}
- Pandas: {pd.__version__}
- Scikit-learn: {__import__('sklearn').__version__}
- Matplotlib: {__import__('matplotlib').__version__}
- Seaborn: {sns.__version__}
{'='*80}
END OF REPORT
{'='*80}
"""
# Save report
report_path = 'artifacts/FINAL_SUMMARY_REPORT.txt'
with open(report_path, 'w') as f:
f.write(summary_report)
print(summary_report)
print(f"\n Final summary report saved to: {report_path}")
# ========== 7. LIST ALL DELIVERABLES ==========
print("\n7. COMPLETE ARTIFACT LISTING")
print("="*50)
print("\nDIRECTORY STRUCTURE:")
for folder in ['models', 'artifacts', 'Output']:
if os.path.exists(folder):
files = os.listdir(folder)
print(f"\n{folder}/ ({len(files)} files)")
for file in sorted(files)[:10]: # Show first 10
filepath = os.path.join(folder, file)
size = os.path.getsize(filepath)
size_str = f"{size/1024:.1f} KB" if size > 1024 else f"{size} B"
print(f" - {file} ({size_str})")
if len(files) > 10:
print(f" ... and {len(files)-10} more files")
print("\n" + "="*80)
print("======= REPRODUCIBILITY & ARTIFACTS COMPLETE =======")
print("="*80)
print("\n ALL DELIVERABLES READY")
print("\n NEXT STEPS:")
print(" 1. Review artifacts/FINAL_SUMMARY_REPORT.txt")
print(" 2. Export notebook to PDF using one of the methods above")
print(" 3. Package all files (notebook + models/ + artifacts/ + Output/) for submission")
print("\n PROJECT COMPLETE!")
======= 12. REPRODUCIBILITY & ARTIFACTS =======
1. VERIFYING ARTIFACTS
==================================================
All expected artifacts present
2. CREATING ARTIFACT SUMMARY
==================================================
Project summary saved to: artifacts/project_summary.json
3. SAVING ENVIRONMENT INFORMATION
==================================================
Environment info saved to: artifacts/environment_info.json
4. EXPORTING NOTEBOOK TO HTML
==================================================
Notebook exported to HTML: artifacts/Team6_Milestone2_Report.html
To convert to PDF:
Option 1: Use your browser to open the HTML and 'Print to PDF'
Option 2: Install wkhtmltopdf and run:
wkhtmltopdf artifacts/Team6_Milestone2_Report.html artifacts/Team6_Milestone2_Report.pdf
Option 3: Use nbconvert with LaTeX (requires LaTeX installation):
jupyter nbconvert --to pdf "Team6_Milestone2.ipynb"
5. FINAL DELIVERABLES CHECKLIST
==================================================
DELIVERABLES STATUS:
EDA with plots
Full preprocessing pipeline
Feature engineering
Final dataset characteristics
Multiple candidate algorithms (14 models)
Cross-validation protocol
Hyperparameter tuning
Model comparison artifact
Selected model identified
Performance metrics (CM, ROC/AUC, PR, accuracy/F1)
Model interpretation (feature importances, PDP)
Reproducibility artifacts
Code quality & documentation
6. GENERATING FINAL SUMMARY REPORT
==================================================
================================================================================
TEAM 6 - MILESTONE 2 - FINAL SUMMARY REPORT
================================================================================
PROJECT INFORMATION
-------------------
Dataset: credit_risk_dataset.csv
Target: loan_status (binary classification: 0=No Default, 1=Default)
Team Members: John Holik, Claiton Pinto, Marina Bunyatova
Timestamp: 2025-11-20 22:02:12
DATA SUMMARY
------------
Total Samples: 32,416
- Training Set: 25,932 (80%)
- Test Set: 6,484 (20%)
Class Distribution (Test Set):
- Class 0 (No Default): 5,066 (78.13%)
- Class 1 (Default): 1,418 (21.87%)
FEATURE ENGINEERING
-------------------
Original Features: 11
Engineered Features: 11
Total Features (after preprocessing): 46
MODELS EVALUATED
----------------
14 models trained and evaluated:
1. DummyClassifier (Baseline)
2. Logistic Regression
3. Linear Discriminant Analysis
4. Quadratic Discriminant Analysis
5. Gaussian Naive Bayes
6. K-Nearest Neighbors
7. Decision Tree
8. Bagging
9. Random Forest
10. AdaBoost
11. Gradient Boosting
12. SVM (Linear)
13. SVM (RBF)
14. Neural Network (MLP)
BEST MODEL SELECTION
--------------------
Selected Model: Gradient Boosting
Selection Criterion: Test AUC (ROC-AUC Score)
FINAL TEST PERFORMANCE
----------------------
ROC AUC: 0.9514
PR AUC (Average Precision): 0.9118
Accuracy: 0.9378
Balanced Accuracy: 0.8678
F1 Score: 0.8395
Brier Score: 0.0499
Sensitivity (Class 0): 0.9923
Specificity (Class 1): 0.7433
Confusion Matrix:
Predicted: No Default Predicted: Default
Actual: No Default 5027 39
Actual: Default 364 1054
ARTIFACTS GENERATED
-------------------
Models: 17 saved models
Data Files: 10 CSV/JSON files
Visualizations: 20 plots
All artifacts saved to:
- models/ (trained models and preprocessors)
- artifacts/ (metrics, summaries, analysis outputs)
- Output/ (EDA and model evaluation plots)
REPRODUCIBILITY
---------------
Random Seed: 42
Python Version: 3.10.19
Key Libraries:
- NumPy: 2.2.5
- Pandas: 2.3.3
- Scikit-learn: 1.7.1
- Matplotlib: 3.10.6
- Seaborn: 0.13.2
================================================================================
END OF REPORT
================================================================================
Final summary report saved to: artifacts/FINAL_SUMMARY_REPORT.txt
7. COMPLETE ARTIFACT LISTING
==================================================
DIRECTORY STRUCTURE:
models/ (17 files)
- adaboost_best.joblib (310.5 KB)
- bagging_best.joblib (62597.2 KB)
- best_model.joblib (1124.5 KB)
- decision_tree_best.joblib (20.5 KB)
- dummy_baseline.joblib (583 B)
- gradient_boosting_best.joblib (1124.5 KB)
- knn_best.joblib (9522.7 KB)
- lda_best.joblib (18.5 KB)
- logistic_regression_best.joblib (1.2 KB)
- mlp_best.joblib (238.1 KB)
... and 7 more files
artifacts/ (56 files)
- FINAL_SUMMARY_REPORT.txt (2.4 KB)
- adaboost_best_params.json (97 B)
- adaboost_metrics.json (353 B)
- bagging_best_params.json (270 B)
- bagging_metrics.json (321 B)
- calibration_curve_test.png (207.8 KB)
- confusion_matrix_test.png (112.6 KB)
- data_dictionary_post_preprocessing.csv (5.3 KB)
- data_dictionary_pre_preprocessing.csv (697 B)
- decision_tree_best_params.json (143 B)
... and 46 more files
Output/ (20 files)
- categorical_distributions.png (321.0 KB)
- categorical_vs_target_stacked.png (340.0 KB)
- numeric_distributions.png (296.0 KB)
- numeric_vs_target_boxplots.png (445.9 KB)
- pearson_correlation_matrix.png (313.7 KB)
- roc_curve_adaboost.png (195.2 KB)
- roc_curve_bagging.png (177.9 KB)
- roc_curve_decision_tree.png (202.5 KB)
- roc_curve_dummy_baseline.png (226.1 KB)
- roc_curve_gradient_boosting.png (184.6 KB)
... and 10 more files
================================================================================
======= REPRODUCIBILITY & ARTIFACTS COMPLETE =======
================================================================================
ALL DELIVERABLES READY
NEXT STEPS:
1. Review artifacts/FINAL_SUMMARY_REPORT.txt
2. Export notebook to PDF using one of the methods above
3. Package all files (notebook + models/ + artifacts/ + Output/) for submission
PROJECT COMPLETE!